You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Amir 'CG' Caspi <ce...@3phase.com> on 2013/06/14 01:05:43 UTC

New rule for HTML spam, using comments?

Lately, I've been getting hit with a LOT of this type of spam:

http://pastebin.com/HD0rNdxU

Not all of it is identical in format, but there seems to be one thing 
in common: they include lots of random garbage inside either CSS or 
in HTML comments.  All of this gets ignored by the HTML parser and 
doesn't display, but is nevertheless in the raw source.  The example 
above includes both types: non-parsing garbage in the CSS header, and 
an HTML comment at the end.

I wonder, can a rule be created that basically looks for incredibly 
long HTML comments (like, multi-KB length comments), and/or looks in 
the CSS for long sequences of garbage?  The former should be 
relatively easy with a regexp; the latter would likely require a 
syntax check to see whether something was valid markup or not.

I'm fairly suspicious that these long strings of garbage are intended 
to try to confuse the Bayesian analysis... not sure if it works as 
intended, but my SA is very clearly missing the spams that I am 
noticing, and on occasion (about 10% of the time) is mistakenly 
autolearning it as ham.  I've been running all these missed messages 
through sa-learn but I am unsure if it's been helping much... hence, 
I am wondering whether a rule might work to increase the spam score 
on these.  I strongly doubt that legitimate email would include 
multi-KB strings of garbage inside HTML comments, so this could be a 
fairly decent spam determiner.

Thoughts?

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by John Hardin <jh...@impsec.org>.

On Tue, 18 Jun 2013, Axb wrote:

> On 06/18/2013 07:24 PM, John Hardin wrote:
>>  On Tue, 18 Jun 2013, Amir 'CG' Caspi wrote:
>> 
>> >  At 10:13 AM -0700 06/18/2013, John Hardin wrote:
>> > >  On Mon, 17 Jun 2013, Amir 'CG' Caspi wrote:
>> > > >  Any idea why it failed to hit, and does this need another rule
>> > >  revision?
>> > > 
>> > >  Yep, and yep. Revision committed. Initial comment gibberish rule
>> > >  committed.
>> > 
>> >  Thanks for the revision.  Do you want to explain why it failed and how
>> >  you fixed it? =)
>>
>>  The earlier version wasn't allowing for some punctuation in the
>>  gibberish. There may be a period of whack-a-mole here, I was
>>  conservative in the change I made.
>
> hope this is nopublish atm
> rules like this can be a perfomance hog.

Yes, they can be. These rules avoid unbounded repetition, alternation and 
situations where backtracking is likely to occur, and are firmly anchored, 
so I think they are not likely to exhibit performance problems.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   So Microsoft's invented the ASCII equivalent to ugly ink spots that
   appear on your letter when your pen is malfunctioning.
          -- Greg Andrews, about Microsoft's way to encode apostrophes
-----------------------------------------------------------------------
  Today: SWMBO's Birthday

Re: New rule for HTML spam, using comments?

Posted by Axb <ax...@gmail.com>.

On 06/18/2013 07:24 PM, John Hardin wrote:
> On Tue, 18 Jun 2013, Amir 'CG' Caspi wrote:
>
>> At 10:13 AM -0700 06/18/2013, John Hardin wrote:
>>> On Mon, 17 Jun 2013, Amir 'CG' Caspi wrote:
>>> > Any idea why it failed to hit, and does this need another rule
>>> revision?
>>>
>>> Yep, and yep. Revision committed. Initial comment gibberish rule
>>> committed.
>>
>> Thanks for the revision.  Do you want to explain why it failed and how
>> you fixed it? =)
>
> The earlier version wasn't allowing for some punctuation in the
> gibberish. There may be a period of whack-a-mole here, I was
> conservative in the change I made.
>

hope this is nopublish atm
rules like this can be a perfomance hog.

RE: New rule for HTML spam, using comments?

Posted by "emailitis.com" <in...@emailitis.com>.

"Now I just have to figure out my Bayes problem..."

Amir,  When you do work that out, please let us know.  We get LOTS of Spam
getting through and John said that it is the BAYES_00 which is causing the
problem.  Restarting training seems a bit extreme.  We cannot monitor every
hosted user, obviously.  We can find patterns in our maillog and I would
love to know more about the sa-learn.

Where I personally have got some which I have moved on webmail into the Spam
folder, can I run this command:
Sa-learn --spam /var/qmail/mailnames/domain.com/user-1/Maildir/.Spam/cur

If we start from scratch because of the large number of false positives, is
there a best practice way that we can monitor the maillog and correct any
false positives or false negatives?  Clearly we cannot watch every email so
some will naturally get through.  Love to know the view of others because my
database is definitely not reporting accurately.

Can I do:
sa-learn --backup
/var/qmail/mailnames/expat-email.com/kuhle/.spamassassin/bayes_toks
and will it save the STDOUT file in the same folder?  
And if so, can I then open that file and get anything useful from it?
http://spamassassin.apache.org/full/3.3.x/doc/sa-learn.txt suggests I can:

 "   --backup
        Performs a dump of the Bayes database in machine/human readable
format.
        The dump will include token and seen data. It is suitable for input
back into the --restore command."

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 10:24 AM -0700 06/18/2013, John Hardin wrote:
>The earlier version wasn't allowing for some punctuation in the 
>gibberish. There may be a period of whack-a-mole here, I was 
>conservative in the change I made.

Makes sense.  Both of those examples are good for creating an 
HTML_COMMENT_GIBBERISH rule, by the way. ;-)

Now I just have to figure out my Bayes problem...

Thanks again.
						--- Amir

Re: New rule for HTML spam, using comments?

Posted by John Hardin <jh...@impsec.org>.

On Tue, 18 Jun 2013, Amir 'CG' Caspi wrote:

> At 10:13 AM -0700 06/18/2013, John Hardin wrote:
>> On Mon, 17 Jun 2013, Amir 'CG' Caspi wrote:
>> > Any idea why it failed to hit, and does this need another rule revision?
>> 
>> Yep, and yep. Revision committed. Initial comment gibberish rule committed.
>
> Thanks for the revision.  Do you want to explain why it failed and how you 
> fixed it? =)

The earlier version wasn't allowing for some punctuation in the gibberish. 
There may be a period of whack-a-mole here, I was conservative in the 
change I made.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   So Microsoft's invented the ASCII equivalent to ugly ink spots that
   appear on your letter when your pen is malfunctioning.
          -- Greg Andrews, about Microsoft's way to encode apostrophes
-----------------------------------------------------------------------
  Today: SWMBO's Birthday

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 10:13 AM -0700 06/18/2013, John Hardin wrote:
>On Mon, 17 Jun 2013, Amir 'CG' Caspi wrote:
>>Any idea why it failed to hit, and does this need another rule revision?
>
>Yep, and yep. Revision committed. Initial comment gibberish rule committed.

Thanks for the revision.  Do you want to explain why it failed and 
how you fixed it? =)

Thanks.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by John Hardin <jh...@impsec.org>.

On Mon, 17 Jun 2013, Amir 'CG' Caspi wrote:

> At 10:48 AM -0700 06/17/2013, John Hardin wrote:
>> On Mon, 17 Jun 2013, Amir 'CG' Caspi wrote:
>> 
>> > I am now seeing STYLE_GIBBERISH hitting on a lot of spam in the past day 
>> > or so, since the new rules hit the distribution.  So far, all TPs, no 
>> > FPs.
>> 
>> Yay!
>
> But, I found one today that should have hit (at least on cursory inspection) 
> and did not.  See http://pastebin.com/Zswg77Ds
>
> There is definitely style gibberish there, but it didn't hit that rule. 
> (Yes, it also hit bayes00, I know... don't bring that up. =P)
>
> Any idea why it failed to hit, and does this need another rule revision?

Yep, and yep. Revision committed. Initial comment gibberish rule 
committed.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   If the rock of doom requires a gentle nudge away from Gaia to
   prevent a very bad day for Earthlings, NASA won’t be riding to the
   rescue. These days, NASA does dodgy weather research and outreach
   programs, not stuff in actual space with rockets piloted by
   flinty-eyed men called Buzz.                       -- Daily Bayonet
-----------------------------------------------------------------------
  Today: SWMBO's Birthday

Re: New rule for HTML spam, using comments?

Posted by Amir Caspi <ce...@3phase.com>.

On Mon, June 17, 2013 11:48 am, John Hardin wrote:
> Well, that's a much harder problem. STYLE tags have a specified format,
> and content not matching that format is (fairly) easy to detect. Comments
> are freeform text - "gibberish" has the same meaning there that it does in
> regular body text.
>
> It's *possible* that converting the __LONGWORDS rules from body to rawbody
> and making them multiline would be justified, but there would have to be
> some discussion about that. They are at present unbounded and doing that
> conversion blindly could be Very Bad.
>
> Perhaps a better approach would be to modify the HTML parser plugin to
> support rules regarding the size of HTML comments. This also could be done
> in a rawbody rule, but the size of comments may not be a useful spam sign.

All of the HTML comment garbage I've seen would explicitly match something
like:
<!-- ([A-Za-z0-9.+-]+[/\n ]){300,} -->

That is, "word" characters with some punctuation, generally
space-delimited (though I've seen some that are slash-delimited), with
lengths of 300 words or more.  Newlines are included in the delimiter
class to allow for splitting over multiple lines.  Obviously, this won't
catch them all, but it should catch most of the comment garbage, I think. 
(I can look through my FNs to see if there are any other potential
patterns.)

I have received a few multi-part spams in the past few weeks, where the
message is (I guess) too long to pass through the MTA... not sure if it
gets split by my mail server or somewhere upstream.  In those cases, the
ending portion of the comment is in part 2 or 3 of the email... I don't
know if spamd runs on the entire email before it gets split, or on the
individual pieces.  If the latter, one could consider two rules, one which
matches whole comments, one which matches either beginning or end (and the
middle content would then have to be correspondingly larger to accommodate
the fact that this is a split message and thus huge comment).

For what it's worth, I think the size of the comments could well be a good
rule.  I can look through my ham but I'm pretty sure that none of it has
enormous comments like the spam does.  These comments contain multiply
kilobytes of text.  I have never seen a ham email that contains multiple
KB of commented material that includes hundreds, sometimes thousands of
words.

Obviously I understand the problem with FPs and the potential disaster of
creating the rule badly.  I think if you require something like 300+ words
within the comment, that would be sufficient to rule out basically every
ham.  You could also give it a relatively low score like 1.5, so that it
adds to spamminess without forcing spam=yes on messages that truly are
ham.

Thanks. =)

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Alex <my...@gmail.com>.

Hi,

>>>> I am now seeing STYLE_GIBBERISH hitting on a lot of spam in the past day
>>>> or so, since the new rules hit the distribution.  So far, all TPs, no
>>>> FPs.
>>>
>>>
>>> Yay!
>>
>>
>> I've also noticed the latest iteration hitting now quite a bit, but
>> also found an FP from groupon:
>>
>> http://pastebin.com/qwdtSqJd
>
> Well, that *is* gibberish in a STYLE tag. Bad coder, no biscuit.
>
> If it persists I can add an exclusion for mail from groupon.com

Yeah, no doubt. I'll add a groupon exception locally for now, and will
let you know if we find any others.

Thanks,
Alex

Re: New rule for HTML spam, using comments?

Posted by Alex <my...@gmail.com>.

Hi,

On Mon, Jun 17, 2013 at 10:39 PM, Benny Pedersen <me...@junc.eu> wrote:
> John Hardin skrev den 2013-06-17 20:52:
>
>>> http://pastebin.com/qwdtSqJd
>>
>> Well, that *is* gibberish in a STYLE tag. Bad coder, no biscuit.
>>
>> If it persists I can add an exclusion for mail from groupon.com
>
> Content analysis details:   (-2.4 points, 5.0 required)
>
>  pts rule name              description
> ---- ----------------------
> --------------------------------------------------
> -0.7 RCVD_IN_DNSWL_LOW      RBL: Sender listed at http://www.dnswl.org/, low
>                             trust
>                             [50.115.211.238 listed in list.dnswl.org]
> -1.1 RP_MATCHES_RCVD        Envelope sender domain matches handover relay
> domain
>  0.0 HTML_IMAGE_RATIO_06    BODY: HTML has a low ratio of text to image area
>  0.0 HTML_MESSAGE           BODY: HTML included in message
>  1.1 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts
> -2.0 RCVD_IN_RP_SAFE        RBL: Sender in ReturnPath Safe - Contact
>                             safe-sa@returnpath.net
>                             [Return Path SenderScore Safe List (formerly]
>                     [Habeas Safelist) -
> <http://www.senderscorecertified.com>]
> -3.0 RCVD_IN_RP_CERTIFIED   RBL: Sender in ReturnPath Certified - Contact
>                             cert-sa@returnpath.net
>                             [Return Path SenderScore Certified {formerly]
>                       [Bonded Sender} -
> <http://www.senderscorecertified.com>]
>  0.1 DKIM_SIGNED            Message has a DKIM or DK signature, not
> necessarily valid
>  0.0 T_DKIM_INVALID         DKIM-Signature header exists but is not valid
>  3.2 STYLE_GIBBERISH        Nonsense in HTML <STYLE> tag
>
> does it need more whitelistning ?

Just curious -- no bayes for you? You're subtracting a lot more for
RP_SAFE than I am.

> why is dkim not valid ?

Perhaps because I've manipulated the real sender to be 'example'? Mine
is signed properly here:

X-Spam-Status: No, score=0.822 tagged_above=-200 required=5
        tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
        DKIM_VALID_AU=-0.1,

Thanks,
Alex

Re: New rule for HTML spam, using comments?

Posted by Benny Pedersen <me...@junc.eu>.

John Hardin skrev den 2013-06-17 20:52:

>> http://pastebin.com/qwdtSqJd
>
> Well, that *is* gibberish in a STYLE tag. Bad coder, no biscuit.
>
> If it persists I can add an exclusion for mail from groupon.com

Content analysis details:   (-2.4 points, 5.0 required)

  pts rule name              description
---- ---------------------- 
--------------------------------------------------
-0.7 RCVD_IN_DNSWL_LOW      RBL: Sender listed at 
http://www.dnswl.org/, low
                             trust
                             [50.115.211.238 listed in list.dnswl.org]
-1.1 RP_MATCHES_RCVD        Envelope sender domain matches handover 
relay domain
  0.0 HTML_IMAGE_RATIO_06    BODY: HTML has a low ratio of text to image 
area
  0.0 HTML_MESSAGE           BODY: HTML included in message
  1.1 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts
-2.0 RCVD_IN_RP_SAFE        RBL: Sender in ReturnPath Safe - Contact
                             safe-sa@returnpath.net
                             [Return Path SenderScore Safe List 
(formerly]
                     [Habeas Safelist) - 
<http://www.senderscorecertified.com>]
-3.0 RCVD_IN_RP_CERTIFIED   RBL: Sender in ReturnPath Certified - 
Contact
                             cert-sa@returnpath.net
                             [Return Path SenderScore Certified 
{formerly]
                       [Bonded Sender} - 
<http://www.senderscorecertified.com>]
  0.1 DKIM_SIGNED            Message has a DKIM or DK signature, not 
necessarily valid
  0.0 T_DKIM_INVALID         DKIM-Signature header exists but is not 
valid
  3.2 STYLE_GIBBERISH        Nonsense in HTML <STYLE> tag

does it need more whitelistning ?

why is dkim not valid ?

-- 
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it

Re: New rule for HTML spam, using comments?

Posted by John Hardin <jh...@impsec.org>.

On Mon, 17 Jun 2013, Alex wrote:

> Hi,
>
> On Mon, Jun 17, 2013 at 1:48 PM, John Hardin <jh...@impsec.org> wrote:
>> On Mon, 17 Jun 2013, Amir 'CG' Caspi wrote:
>>
>>> At 7:20 PM -0700 06/15/2013, John Hardin wrote:
>>>>
>>>> I took a closer look at this and it seems they're working around trivial
>>>> gibberish detection by putting a valid CSS property at the very beginning of
>>>> the style tag.
>>>>
>>>> Revising the rules...
>>>
>>> I am now seeing STYLE_GIBBERISH hitting on a lot of spam in the past day
>>> or so, since the new rules hit the distribution.  So far, all TPs, no FPs.
>>
>> Yay!
>
> I've also noticed the latest iteration hitting now quite a bit, but
> also found an FP from groupon:
>
> http://pastebin.com/qwdtSqJd

Well, that *is* gibberish in a STYLE tag. Bad coder, no biscuit.

If it persists I can add an exclusion for mail from groupon.com

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Liberals love sex ed because it teaches kids to be safe around their
   sex organs. Conservatives love gun education because it teaches kids
   to be safe around guns. However, both believe that the other's
   education goals lead to dangers too terrible to contemplate.
-----------------------------------------------------------------------
  Tomorrow: SWMBO's Birthday

Re: New rule for HTML spam, using comments?

Posted by Alex <my...@gmail.com>.

Hi,

On Mon, Jun 17, 2013 at 1:48 PM, John Hardin <jh...@impsec.org> wrote:
> On Mon, 17 Jun 2013, Amir 'CG' Caspi wrote:
>
>> At 7:20 PM -0700 06/15/2013, John Hardin wrote:
>>>
>>> I took a closer look at this and it seems they're working around trivial
>>> gibberish detection by putting a valid CSS property at the very beginning of
>>> the style tag.
>>>
>>> Revising the rules...
>>
>> I am now seeing STYLE_GIBBERISH hitting on a lot of spam in the past day
>> or so, since the new rules hit the distribution.  So far, all TPs, no FPs.
>
> Yay!

I've also noticed the latest iteration hitting now quite a bit, but
also found an FP from groupon:

http://pastebin.com/qwdtSqJd

Thanks,
Alex

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 10:48 AM -0700 06/17/2013, John Hardin wrote:
>On Mon, 17 Jun 2013, Amir 'CG' Caspi wrote:
>
>>I am now seeing STYLE_GIBBERISH hitting on a lot of spam in the 
>>past day or so, since the new rules hit the distribution.  So far, 
>>all TPs, no FPs.
>
>Yay!

But, I found one today that should have hit (at least on cursory 
inspection) and did not.  See http://pastebin.com/Zswg77Ds

There is definitely style gibberish there, but it didn't hit that 
rule.  (Yes, it also hit bayes00, I know... don't bring that up. =P)

Any idea why it failed to hit, and does this need another rule revision?

You can also see the HTML comment spam here, and in my previous 
example, which could be a basis for an HTML_COMMENT_GIBBERISH rule, 
per the earlier emails.

Cheers.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Benny Pedersen <me...@junc.eu>.

Amir 'CG' Caspi skrev den 2013-06-20 11:13:

> BTW, I'm not talking about _actually_ reversing MailScanner's
> "protection."  I'm talking about SA understanding enough to unmunge
> the URI **for SA processing only**.  The actual mail delivered to the
> end-user would remain munged.  SA would not be reversing anything, it
> would simply have sufficient know-how to parse what MailScanner had
> done.

is it not possible to run spamassassin first ?, and then later in 
mailscanner make the munges ?

eg if mailscanner is brain dead designed, call it once with munges 
enabled, and one more time with spamassassin only, that way you can 
control order of munges, and you will get better results of calling 
mailscanner the first time, hmm :)

i think its just out of order what you do, not what you like to be the 
end results

> I apologize for getting everyone's gander up.  I'll drop this
> MailScanner subject and return to my hole.

then no one can help you :)


-- 
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 9:47 AM +0200 06/20/2013, Tom Hendrikx wrote:
>Since mailscanner already has support for integrating spamassassin [1]

(As I mentioned explicitly in a previous email...)

>why would you ever want to put work in reversing some of mailscanners
>'protection'?

Because, given the particularls of my system setup, I don't want to 
have MailScanner running spamassassin -- I need them to be 
independent.  Hence, if I cared enough about having the web bugs pop 
in SA, I would want SA to "reverse" the MailScanner effects, not to 
have MailScanner run SA.

BTW, I'm not talking about _actually_ reversing MailScanner's 
"protection."  I'm talking about SA understanding enough to unmunge 
the URI **for SA processing only**.  The actual mail delivered to the 
end-user would remain munged.  SA would not be reversing anything, it 
would simply have sufficient know-how to parse what MailScanner had 
done.

>or disable the url munging in mailscanner?

I don't want to disable the URI munging for the reasons I outlined in 
a previous email: in short, I don't want end-users interacting with 
spam-increasing web bugs.

>For the result that you want to achieve (get protection from both
>filters), your proposed solution seems to be the hardest way to success.

I will disagree with you on this point, mostly because you don't know 
the particulars of my system setup and therefore all of the 
"difficulties" that would arise with implementing a different 
solution.

Let me be clear: I'm perfectly satisfied with doing nothing about 
unmunging MailScanner.  As I mentioned in a previous email, in every 
case where a web bug gets munged, that same domain is linked 
elsewhere in the email and is inevitably processed by SA (I won't say 
"caught" since that requires the domain to be blacklisted, which may 
not be the case when processed).  Thus, I don't actually believe that 
this web bug munging is causing significant harm to SA's processing.

But, Axb's responses indicated that he/she seemed to think it would 
be an issue.  Thus, I suggested that a plugin could be written to 
"unmunge" the munged web bug.  (Again, this would be purely for SA 
internal processing, and the delivered mail would retain 
MailScanner's "protection.")  The plugin COULD be as simple as taking 
MailScanner's default format, processing it through a regexp to pick 
the original URI out of the img tag's alt attribute, and passing that 
to the URIBL plugin (which seems pretty simple to me).  Or, it could 
be more complex to take into account potential user configurations of 
MailScanner, and thus would read the MailScanner config files.

Personally, I don't think my proposed solution is all that complex, 
and certainly no more complex than trying to figure out how to get 
MailScanner to play properly with SA given my particular virtual 
hosting setup.  On the other hand, as I mentioned 2 paragraphs up, 
I'm also OK with doing nothing, which is certainly the easiest and 
least error-prone solution.

So, let's all just realize that this was primarily a thought 
experiment, nobody seems interested in following or implementing it, 
and (particularly at this point) I don't think it's worth it any more.

I apologize for getting everyone's gander up.  I'll drop this 
MailScanner subject and return to my hole.

Cheers.
						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Tom Hendrikx <to...@whyscream.net>.

On 06/20/2013 01:34 AM, Amir 'CG' Caspi wrote:
> On Wed, June 19, 2013 3:47 pm, Axb wrote:
>> SA's URIBL plugin doesn't and shouldn't look in the alt attribute.
> 
> Why not, exactly?  I wouldn't look at it for _all_ img tags, only for ones
> that are clearly MailScanner-munged.  That is, one would look for the
> patterns that MailScanner uses for munging, and if detected, pull out the
> original URI from the alt attribute.  I admit to being new to the SA game
> but I'm not understanding why that "shouldn't" happen, i.e. why it's bad,
> against form, insecure, etc.
> 
> Now, MailScanner's munging format is, IIRC, user-configurable.  Therefore,
> there may not be a fully universal munged format (although there is
> certainly a "default" format).  So, one way to glue this to MailScanner is
> to have SA load the MailScanner config, figure out what the munged format
> is from that, and use that as the rule for whether or not to look in the
> alt attribute.  If MailScanner is not installed or one does not want to
> glue them together, then one would use the default format.  And, of
> course, this could be completely user-toggleable, i.e. one could choose
> whether to unmunge MailScanner tags, or leave them as-is (i.e. what
> currently happens).
> 
> Also, I should clarify that I wasn't advocating for a modification to the
> URIBL plugin, but rather the creation of a NEW plugin that would unmunge
> MailScanner URIs.  This plugin would pre-process the mail prior to the
> URIBL and Bayes analysis, to return the mail to its "original" state
> before MailScanner munged it.  If that's not possible due to how SA
> plugins work (i.e. if you can't specify the order of plugins being run)
> then it could simply run alongside URIBL as a "Mailscanner-unmunged URIBL"
> ...
> 
> In any case, I guess I don't see why this isn't possible or not
> recommended.  I only see that nobody has done it, but I don't see that it
> shouldn't be done.
> 

Since mailscanner already has support for integrating spamassassin [1],
why would you ever want to put work in reversing some of mailscanners
'protection'? Why don't you try the integration docs first, change the
processing order (i.e. process the mail with spamassassin first, then
with mailscanner), or disable the url munging in mailscanner?

For the result that you want to achieve (get protection from both
filters), your proposed solution seems to be the hardest way to success.
Not to mention probably the most error prone, or involving large amounts
of labor. Your proposal is not per definition impossible or plain stupid
(can't judge on both on both of those), it's just that there are many
reasons to try other routes before going down that road...

[1] http://www.mailscanner.info/spamassassin.html

Regards,
	Tom

Disclaimer: I have never used mailscanner, so I don't claim any
knowledge beyond anything a 2 minute googling session wouldn't turn up.

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

On Wed, June 19, 2013 3:47 pm, Axb wrote:
> SA's URIBL plugin doesn't and shouldn't look in the alt attribute.

Why not, exactly?  I wouldn't look at it for _all_ img tags, only for ones
that are clearly MailScanner-munged.  That is, one would look for the
patterns that MailScanner uses for munging, and if detected, pull out the
original URI from the alt attribute.  I admit to being new to the SA game
but I'm not understanding why that "shouldn't" happen, i.e. why it's bad,
against form, insecure, etc.

Now, MailScanner's munging format is, IIRC, user-configurable.  Therefore,
there may not be a fully universal munged format (although there is
certainly a "default" format).  So, one way to glue this to MailScanner is
to have SA load the MailScanner config, figure out what the munged format
is from that, and use that as the rule for whether or not to look in the
alt attribute.  If MailScanner is not installed or one does not want to
glue them together, then one would use the default format.  And, of
course, this could be completely user-toggleable, i.e. one could choose
whether to unmunge MailScanner tags, or leave them as-is (i.e. what
currently happens).

Also, I should clarify that I wasn't advocating for a modification to the
URIBL plugin, but rather the creation of a NEW plugin that would unmunge
MailScanner URIs.  This plugin would pre-process the mail prior to the
URIBL and Bayes analysis, to return the mail to its "original" state
before MailScanner munged it.  If that's not possible due to how SA
plugins work (i.e. if you can't specify the order of plugins being run)
then it could simply run alongside URIBL as a "Mailscanner-unmunged URIBL"
...

In any case, I guess I don't see why this isn't possible or not
recommended.  I only see that nobody has done it, but I don't see that it
shouldn't be done.

Cheers.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Axb <ax...@gmail.com>.

On 06/19/2013 11:30 PM, Amir 'CG' Caspi wrote:
>
> Yes, MailScanner gets to it before SA does, unless SA is called from
> within MailScanner (which it isn't, on my setup, but that is a possible
> setup).  However, the complete original URL is still contained within the
> munged one.  It's in the alt attribute of the img tag, as you can see in
> the examples I posted.  Therefore, a crystal ball is hardly necessary...
> just a regexp.  I think that's pretty possible, don't you?

SA's URIBL plugin doesn't and shouldn't look in the alt attribute.
As you have not given any details on how you glue SA if not with 
MailScanner it's anybody's guess if its possible or not or even worth it.

>> btw - all these spams have a very obvious trait.
>> Look at the source thoroughly.
>
> Are you referring to the <font size=0%> tag, or the X-nnn header?  Or
> neither one?
>
> Either way, by "all" do you mean the two I just posted?  That's hardly
> "all."  There are many examples of style-gibberish spam that do not share
> either of the above traits.

the two samples you supplied do indeed include these but there may be 
more. (God, I miss the SARE pattern hunting contests .-)

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

On Wed, June 19, 2013 3:14 pm, Axb wrote:
> iirc, MailScanner munges the URL befor SA sees it so unless your plugin
> idea involves a crystal ball, it's not possible.

Yes, MailScanner gets to it before SA does, unless SA is called from
within MailScanner (which it isn't, on my setup, but that is a possible
setup).  However, the complete original URL is still contained within the
munged one.  It's in the alt attribute of the img tag, as you can see in
the examples I posted.  Therefore, a crystal ball is hardly necessary...
just a regexp.  I think that's pretty possible, don't you?

> btw - all these spams have a very obvious trait.
> Look at the source thoroughly.

Are you referring to the <font size=0%> tag, or the X-nnn header?  Or
neither one?

Either way, by "all" do you mean the two I just posted?  That's hardly
"all."  There are many examples of style-gibberish spam that do not share
either of the above traits.

Thanks.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Axb <ax...@gmail.com>.

On 06/19/2013 10:54 PM, Amir Caspi wrote:
> Perhaps SA should include a module/plugin to "unmunge" MailScanner
> munging?  Has anyone written one, or if not, would anyone like to? ;-)
> (Since MailScanner is open-source perl, I imagine it should be relatively
> straightforward to find the munging code, write the reverse of it, and
> make that an SA plugin... I'm not sufficiently experienced to do it at the
> moment, but maybe someone else is interested.)

iirc, MailScanner munges the URL befor SA sees it so unless your plugin 
idea involves a crystal ball, it's not possible.

btw - all these spams have a very obvious trait.
Look at the source thoroughly.

Re: New rule for HTML spam, using comments?

Posted by Amir Caspi <ce...@3phase.com>.

On Wed, June 19, 2013 2:33 pm, Axb wrote:
> imo, it makes little sense to write rules to catch these hashbusters. As

If the rule is sufficiently broad, it will catch them.  If the rule is so
strict that it catches only one trailing slash or something, then yes, it
makes little sense... but I think it should be possible to write the rule
to be sufficiently generic.  I'm hoping John is trying to be as generic as
possible (while obviously minimizing FPs).  Basically, look for long
strings of stuff that cannot possibly be a valid HTML or CSS tag... if
it's there, consider it gibberish and spammy.  There are known regexps for
valid HTML/CSS markup; the rule could, in principle, simply match on the
negation of those regexps, with sufficient repetition.  (This is the same
reason why I think we need an HTML comment gibberish rule, and how it
could be implemented.)

> I'd suggest you disable MailScanner's remote img munging - this is so
> 2004... (MUAS block remote images anyway)

Mail clients only block remote images if they are set to do so.  While
this may be the default setting on most clients, it's not the default on
all, and it can be overridden by the user (globally or on a per-message
basis).  Web bugs embedded in an email server only one purpose: to verify
that an email has been read.  For legitimate emails, they're basically
innocuous; for spam, they are potentially harmful since they verify the
spam recipient address as valid.

I don't want _ANY_ of my users interacting with web bugs, whether because
they deliberately turned on the "view remote images" feature of their
client, either globally or even just for single message (but they most
probably don't understand that this exposes web bugs), or because that
feature was somehow enabled by default and they don't know enough to turn
it off.  Either way, I don't want the web bugs followed, and hence I
prefer to retain this (perhaps outdated but IMHO still useful) feature of
MailScanner.

> the image URL may contain a listed domain and you'll miss it.

You're right that SA may miss it, but in my experience, the spam body
typically contains that same domain in (often many) other links or image
tags (not web bugs, meaning MailScanner won't munge them), so SA will
usually pick it up anyway.

Perhaps SA should include a module/plugin to "unmunge" MailScanner
munging?  Has anyone written one, or if not, would anyone like to? ;-) 
(Since MailScanner is open-source perl, I imagine it should be relatively
straightforward to find the munging code, write the reverse of it, and
make that an SA plugin... I'm not sufficiently experienced to do it at the
moment, but maybe someone else is interested.)

> As this is applied to ham as well as spam, your bayes will learn
> mailscanner.tv as spam AND ham making it harder to be effective.

In other words, the munging won't have any effect on the Bayes DB since
it's applied to both ham and spam.  So, I don't quite see the problem.  If
I remove munging, it has no effect on spam or ham... if I retain it, it
has basically no effect on spam or ham.  So, Bayes will pretty much just
ignore that token.  But, per above, the same domain is generally mentioned
elsewhere in the message, so the appropriate token should still get picked
up.

As above, I prefer to retain this feature to prevent any interaction with
web bugs, since mail clients CAN load remote images (on purpose or not).

> Are you using RAZOR? if not, it may be time to deploy.

Yes, I am using both Razor and Pyzor.  Both of them are getting positive
hits on a lot of received spam (Razor more often than Pyzor, but both do
hit).

Thanks.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Axb <ax...@gmail.com>.

On 06/19/2013 10:11 PM, cepheid@3phase.com wrote:
> Hi John,
>
> See the following example:
>
> http://pastebin.com/DAYJ7NnJ
>
> Lots of style gibberish for sure, but it failed to hit your rule
> (sa-update ran at 4am today so it should have picked up anything
> published).  I'm guessing it's the parentheses.
>
> Whack the mole! =)
>
>                          --- Amir
>
> p.s. On the upside, at least it hit bayes99! ;-)

imo, it makes little sense to write rules to catch these hashbusters. As 
soon as a rule is show or published it the pattern wil be changed and 
the rat race continues.

I'd suggest you disable MailScanner's remote img munging - this is so 
2004... (MUAS block remote images anyway)

http://www.mailscanner.tv/1x1spacer.gif...........

the image URL may contain a listed domain and you'll miss it.
As this is applied to ham as well as spam, your bayes will learn 
mailscanner.tv as spam AND ham making it harder to be effective.

Are you using RAZOR? if not, it may be time to deploy.

Re: New rule for HTML spam, using comments?

Posted by Benny Pedersen <me...@junc.eu>.

cepheid@3phase.com skrev den 2013-06-19 22:11:
> Hi John,
>
> See the following example:
>
> http://pastebin.com/DAYJ7NnJ
>
> Lots of style gibberish for sure, but it failed to hit your rule
> (sa-update ran at 4am today so it should have picked up anything
> published).  I'm guessing it's the parentheses.
>
> Whack the mole! =)

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: missing </title> before <style>
line 297 column 5 - Warning: '<' + '/' + letter not allowed here
line 299 column 1 - Warning: discarding unexpected </title>
line 305 column 664 - Warning: <a> discarding newline in URI reference
line 308 column 619 - Warning: <img> discarding newline in URI 
reference
line 309 column 933 - Warning: <img> unexpected or duplicate quote mark
line 309 column 933 - Warning: <img> attribute with missing trailing 
quote mark
line 5 column 1 - Warning: <style> inserting "type" attribute
line 305 column 664 - Warning: <a> escaping malformed URI reference
line 307 column 70 - Warning: <b> proprietary attribute "r"
line 308 column 619 - Warning: <img> escaping malformed URI reference
line 308 column 619 - Warning: <img> lacks "alt" attribute
line 309 column 933 - Warning: <img> attribute "width" has invalid 
value "!"
line 319 column 40 - Warning: <img> lacks "alt" attribute
line 307 column 70 - Warning: trimming empty <b>
Info: Document content looks like HTML 4.01 Transitional
Info: No system identifier in emitted doctype
16 warnings, 0 errors were found!

in gentoo:

emerge -av htmltidy ripmime

ripmime -i msg -d /tmp/
tidy -o textfile0.html -f textfile0.err /tmp/textfile0

see textfile0.err now

the day where i see 0 warnings i will raise the flag in my garden :)

Re: New rule for HTML spam, using comments?

Posted by Amir Caspi <ce...@3phase.com>.

Another, nearly identical example I saw today , but which used trailing
slashes (/ or //) instead of parentheses.

http://pastebin.com/6XRwcjm3

Enjoy. =)

						--- Amir

On Wed, June 19, 2013 2:11 pm, cepheid@3phase.com wrote:
> Hi John,
>
> See the following example:
>
> http://pastebin.com/DAYJ7NnJ
>
> Lots of style gibberish for sure, but it failed to hit your rule
> (sa-update ran at 4am today so it should have picked up anything
> published).  I'm guessing it's the parentheses.
>
> Whack the mole! =)
>
> 						--- Amir
>
> p.s. On the upside, at least it hit bayes99! ;-)
>

Re: New rule for HTML spam, using comments?

Posted by ce...@3phase.com.

Hi John,

See the following example:

http://pastebin.com/DAYJ7NnJ

Lots of style gibberish for sure, but it failed to hit your rule 
(sa-update ran at 4am today so it should have picked up anything 
published).  I'm guessing it's the parentheses.

Whack the mole! =)

						--- Amir

p.s. On the upside, at least it hit bayes99! ;-)

Re: New rule for HTML spam, using comments?

Posted by John Hardin <jh...@impsec.org>.

On Mon, 17 Jun 2013, Amir 'CG' Caspi wrote:

> At 7:20 PM -0700 06/15/2013, John Hardin wrote:
>> I took a closer look at this and it seems they're working around trivial 
>> gibberish detection by putting a valid CSS property at the very beginning 
>> of the style tag.
>> 
>> Revising the rules...
>
> I am now seeing STYLE_GIBBERISH hitting on a lot of spam in the past day or 
> so, since the new rules hit the distribution.  So far, all TPs, no FPs.

Yay!

> Would you be willing to create an HTML_COMMENT_GIBBERISH rule, which 
> would be very similar to this one, but which looks for long strings of 
> gibberish instead HTML comments?  (That is, <!-- gibberish -->). A 
> number of FN spams that leak through are using gibberish comments 
> without gibberish styles.  I would imagine detecting this should be 
> quite similar to detecting style gibberish...

Well, that's a much harder problem. STYLE tags have a specified format, 
and content not matching that format is (fairly) easy to detect. Comments 
are freeform text - "gibberish" has the same meaning there that it does in 
regular body text.

It's *possible* that converting the __LONGWORDS rules from body to rawbody 
and making them multiline would be justified, but there would have to be 
some discussion about that. They are at present unbounded and doing that 
conversion blindly could be Very Bad.

Perhaps a better approach would be to modify the HTML parser plugin to 
support rules regarding the size of HTML comments. This also could be done 
in a rawbody rule, but the size of comments may not be a useful spam sign.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Activist: Someone who gets involved.
   Unregistered Lobbyist: Someone who gets involved with something
     the MSM doesn't approve of.                           -- WizardPC
-----------------------------------------------------------------------
  Tomorrow: SWMBO's Birthday

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 7:20 PM -0700 06/15/2013, John Hardin wrote:
>I took a closer look at this and it seems they're working around 
>trivial gibberish detection by putting a valid CSS property at the 
>very beginning of the style tag.
>
>Revising the rules...

I am now seeing STYLE_GIBBERISH hitting on a lot of spam in the past 
day or so, since the new rules hit the distribution.  So far, all 
TPs, no FPs.

Would you be willing to create an HTML_COMMENT_GIBBERISH rule, which 
would be very similar to this one, but which looks for long strings 
of gibberish instead HTML comments?  (That is, ). 
A number of FN spams that leak through are using gibberish comments 
without gibberish styles.  I would imagine detecting this should be 
quite similar to detecting style gibberish...

I could provide one or more examples if you need.

Thanks in advance. =)

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by John Hardin <jh...@impsec.org>.

On Fri, 14 Jun 2013, Alex wrote:

>>>> http://ruleqa.spamassassin.org/20130613-r1492572-n/STYLE_GIBBERISH/detail
>>>
>>> John, I've just tried with your latest, and his sample doesn't hit
>>> STYLE_GIBBERISH. Any suggestions?
>>
>> Hmm. I created an HTML message with a series of words in the style tag and
>> it did hit. I'll try it on your sample directly.
>
> It still doesn't hit.

I took a closer look at this and it seems they're working around trivial 
gibberish detection by putting a valid CSS property at the very beginning 
of the style tag.

Revising the rules...

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Gun Control laws aren't enacted to control guns, they are enacted
   to control people: catholics (1500s), japanese peasants (1600s),
   blacks (1860s), italian immigrants (1911), the irish (1920s),
   jews (1930s), blacks (1960s), the poor (always)
-----------------------------------------------------------------------
  3 days until SWMBO's Birthday

Re: New rule for HTML spam, using comments?

Posted by Benny Pedersen <me...@junc.eu>.

Alex skrev den 2013-06-14 19:57:

> http://pastebin.com/P3mQbwmH

ripmime -i msg -d /tmp
tidy -o html -f error textfile0

gives me this error file content:

line 7 column 1 - Warning: inserting implicit <body>
line 8 column 1 - Warning: discarding unexpected <body>
line 12 column 9 - Warning: <style> isn't allowed in <body> elements
line 7 column 1 - Info: <body> previously mentioned
line 60 column 195 - Warning: discarding unexpected </font>
line 63 column 133 - Warning: missing </strong> before </td>
line 97 column 116 - Warning: unescaped & or unknown entity "&these"
line 55 column 1 - Warning: missing </center>
line 58 column 9 - Warning: <table> lacks "summary" attribute
line 89 column 9 - Warning: <img> lacks "alt" attribute
Info: Document content looks like HTML 4.01 Transitional
Info: No system identifier in emitted doctype
9 warnings, 0 errors were found!

cant spammassassin check it ?, with perltidy as a plugin ?

using it that way here to create local rules

-- 
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it

Re: New rule for HTML spam, using comments?

Posted by Alex <my...@gmail.com>.

Hi,

On Fri, Jun 14, 2013 at 9:51 AM, John Hardin <jh...@impsec.org> wrote:
> On Thu, 13 Jun 2013, Alex wrote:
>
>> Hi,
>>
>> On Thu, Jun 13, 2013 at 9:55 PM, John Hardin <jh...@impsec.org> wrote:
>>>
>>> On Thu, 13 Jun 2013, Amir 'CG' Caspi wrote:
>>>
>>>> Lately, I've been getting hit with a LOT of this type of spam:
>>>>
>>>> http://pastebin.com/HD0rNdxU
>>>
>>>
>>> http://ruleqa.spamassassin.org/20130613-r1492572-n/STYLE_GIBBERISH/detail
>>
>>
>> John, I've just tried with your latest, and his sample doesn't hit
>> STYLE_GIBBERISH. Any suggestions?
>
>
> Hmm. I created an HTML message with a series of words in the style tag and
> it did hit. I'll try it on your sample directly.
>
> There are some FP-reduction exclusions, add a local rule like:
>
>   meta   STYLE_GIBBERISH_RAW   __STYLE_GIBBERISH
>   score  STYLE_GIBBERISH_RAW   0.0001
>
> to see if the FP exclusions are keeping it from hitting for you.

It still doesn't hit. I'm also noticing a different form that appears
to just include the gibberish in the body itself, not surrounded by
style tags. It also confuses bayes and doesn't hit longwords either.

http://pastebin.com/P3mQbwmH

In the longrun, I'm not sure how effective this would be anyway. Other
times they just use gibberish news feeds with actual punctuation,
which would prevent these rules from firing anyway.

I'm actually more interested in knowing why this didn't hit bayes99
when many others already do. Is the body gibberish enough to prevent
it from being classified properly?

Thanks,
Alex

Re: New rule for HTML spam, using comments?

Posted by John Hardin <jh...@impsec.org>.

On Thu, 13 Jun 2013, Alex wrote:

> Hi,
>
> On Thu, Jun 13, 2013 at 9:55 PM, John Hardin <jh...@impsec.org> wrote:
>> On Thu, 13 Jun 2013, Amir 'CG' Caspi wrote:
>>
>>> Lately, I've been getting hit with a LOT of this type of spam:
>>>
>>> http://pastebin.com/HD0rNdxU
>>
>> http://ruleqa.spamassassin.org/20130613-r1492572-n/STYLE_GIBBERISH/detail
>
> John, I've just tried with your latest, and his sample doesn't hit
> STYLE_GIBBERISH. Any suggestions?

Hmm. I created an HTML message with a series of words in the style tag and 
it did hit. I'll try it on your sample directly.

There are some FP-reduction exclusions, add a local rule like:

   meta   STYLE_GIBBERISH_RAW   __STYLE_GIBBERISH
   score  STYLE_GIBBERISH_RAW   0.0001

to see if the FP exclusions are keeping it from hitting for you.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Democrats '61: Ask not what your country can do for you,
    ask what you can do for your country.
   Democrats '07: Ask not what your country can do for you,
    demand it!
-----------------------------------------------------------------------
  4 days until SWMBO's Birthday

Re: New rule for HTML spam, using comments?

Posted by Alex <my...@gmail.com>.

Hi,

On Thu, Jun 13, 2013 at 9:55 PM, John Hardin <jh...@impsec.org> wrote:
> On Thu, 13 Jun 2013, Amir 'CG' Caspi wrote:
>
>> Lately, I've been getting hit with a LOT of this type of spam:
>>
>> http://pastebin.com/HD0rNdxU
>>
>> Not all of it is identical in format, but there seems to be one thing in
>> common: they include lots of random garbage inside either CSS or in HTML
>> comments.  All of this gets ignored by the HTML parser and doesn't display,
>> but is nevertheless in the raw source.  The example above includes both
>> types: non-parsing garbage in the CSS header, and an HTML comment at the
>> end.
>>
>> I wonder, can a rule be created that basically looks for incredibly long
>> HTML comments (like, multi-KB length comments), and/or looks in the CSS for
>> long sequences of garbage?
>
> http://ruleqa.spamassassin.org/20130613-r1492572-n/STYLE_GIBBERISH/detail

John, I've just tried with your latest, and his sample doesn't hit
STYLE_GIBBERISH. Any suggestions?

Also, can you explain which are the relevant percentages on the ruleqa
page that are most useful? Is it the aggregate value, which shows this
rule appears in about 0.0022 percent ham and  0.1895 percent spam?

Thanks,
Alex

Re: New rule for HTML spam, using comments?

Posted by John Hardin <jh...@impsec.org>.

On Thu, 13 Jun 2013, Amir 'CG' Caspi wrote:

> Lately, I've been getting hit with a LOT of this type of spam:
>
> http://pastebin.com/HD0rNdxU
>
> Not all of it is identical in format, but there seems to be one thing in 
> common: they include lots of random garbage inside either CSS or in HTML 
> comments.  All of this gets ignored by the HTML parser and doesn't display, 
> but is nevertheless in the raw source.  The example above includes both 
> types: non-parsing garbage in the CSS header, and an HTML comment at the end.
>
> I wonder, can a rule be created that basically looks for incredibly long HTML 
> comments (like, multi-KB length comments), and/or looks in the CSS for long 
> sequences of garbage?

http://ruleqa.spamassassin.org/20130613-r1492572-n/STYLE_GIBBERISH/detail

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Yet another example of a Mexican doing a job Americans are
   unwilling to do.   -- Reno Sepulveda, on UniVision reporters asking
                         President Obama some pointed questions about
                         the BATFE Fast and Furious scandal.
-----------------------------------------------------------------------
  5 days until SWMBO's Birthday

Re: New rule for HTML spam, using comments?

Posted by Benny Pedersen <me...@junc.eu>.

Amir 'CG' Caspi skrev den 2013-06-14 01:05:
> Lately, I've been getting hit with a LOT of this type of spam:
>
> http://pastebin.com/HD0rNdxU
>
> Not all of it is identical in format, but there seems to be one thing
> in common: they include lots of random garbage inside either CSS or 
> in
> HTML comments.  All of this gets ignored by the HTML parser and
> doesn't display, but is nevertheless in the raw source.  The example
> above includes both types: non-parsing garbage in the CSS header, and
> an HTML comment at the end.

tidy -m msg, we could start validate css and html spammails ?

i have some rules that hits on invalid html tags


-- 
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it

Re: New rule for HTML spam, using comments?

Posted by Wolfgang Zeikat <wo...@desy.de>.

In an older episode, on 2013-06-14 01:36, Amir 'CG' Caspi wrote:

> (I am relatively new to SA's internal workings and don't know how to 
> make such a rule, however.)

For basics of writing SA rules, maybe look at
http://wiki.apache.org/spamassassin/WritingRules

Hope this helps,

wolfgang

Re: New rule for HTML spam, using comments?

Posted by Kris Deugau <kd...@vianet.ca>.

Amir 'CG' Caspi wrote:
> At 8:58 AM -0400 06/18/2013, Ben Johnson wrote:
>> a.) You are copying/pasting the body of the email, but not the headers.
> 
> No, I am copying the headers... however, I am using Eudora (ancient, I
> know) as a mail client, and it's possible the headers are not properly
> formatted.  For example, for SpamCop I have to use their "workaround"
> script.  I don't know what exactly is mal-formed, though.
> 
> I should admit at this point that much of my sa-learn has been on
> Eudora's mboxes, by the way.  That is, I would take the Eudora mbox and
> sa-learn on that.  Eudora is supposed to use standard mbox format, but
> I'm wondering if maybe it's not so standard after all...

Try opening the on-disk file with Notepad (or your favourite text editor
on *nix).  If you see the same thing you see when you hit the "blah blah
blah" button in Eudora, you should be OK.  If not...

-kgd

Re: New rule for HTML spam, using comments?

Posted by Ben Johnson <be...@indietorrent.org>.

On 6/18/2013 1:18 PM, Amir 'CG' Caspi wrote:
> At 8:58 AM -0400 06/18/2013, Ben Johnson wrote:
>> a.) You are copying/pasting the body of the email, but not the headers.
> 
> No, I am copying the headers... however, I am using Eudora (ancient, I
> know) as a mail client, and it's possible the headers are not properly
> formatted.  For example, for SpamCop I have to use their "workaround"
> script.  I don't know what exactly is mal-formed, though.

For the sake of troubleshooting, can you try accessing the mail by some
other means, e.g., opening the file directly from the filesystem?
Doesn't mbox store email messages as plaintext files? (Kris already beat
me to it regarding this suggestion.)

> I should admit at this point that much of my sa-learn has been on
> Eudora's mboxes, by the way.  That is, I would take the Eudora mbox and
> sa-learn on that.  Eudora is supposed to use standard mbox format, but
> I'm wondering if maybe it's not so standard after all...

How would anything ever be flagged with a score higher than BAYES_00 if
this were to be the problem? Didn't you report a score of BAYES_99 in
one of your tests?

> Either way, I am _trying_ to copy the entire message.  Not sure what is
> misformatted there.  If you take a look at my two pasted examples (links
> below for convenience), those are direct copy/paste from Eudora's "raw
> source" view.  Any idea what is malformed?  Do I need an extra newline
> between the header and body, or something more complicated?
> 
> http://pastebin.com/HD0rNdxU
> http://pastebin.com/Zswg77Ds

How are you feeding the messages to sa-learn? Are you not just passing
the email file, e.g., /var/vmail/example.com/...? Why copy from Eudora
and paste into a temporary file when you can just point sa-learn
straight to the message on disk?

>> b.) You are running Bayes as two different users when you perform your
> 
> No, I have been careful for that.  You saw that I pasted the maillog
> entries... notice that spamd runs as setuid.  I made sure the same
> userid was in the logs, and in my command.

I had missed that detail; looks okay.

>> Have a look at the thread I cited and see if anything jumps-out at you.
> 
> Will do, but unfortunately, I don't think the problem is as clear cut as
> (b) ... maybe it's (a) though, in which case I wonder if I have to
> modify my Eudora mboxes before learning on them.

Do you retain your training corpus? This may be one of those instances
in which the best way to debug the problem is to wipe and retrain Bayes.
Of course, that can be a nightmare if you don't retain the messages that
you've trained as ham and spam.

> Thanks.
> 
>                         -- Amir

Re: New rule for HTML spam, using comments?

Posted by Amir Caspi <ce...@3phase.com>.

On Tue, June 18, 2013 4:36 pm, RW wrote:
> One thing to watch out for is that a mailbox may contain hidden deleted
> mail that remains there until the mail client compacts/expunges the
> mailbox. For that reason I prefer explicit training folders rather than
> folders where misclassified mails have been moved-out.

Indeed, that can certainly be an issue.  I make sure to create new folders
specifically for training (though I end up not keeping the messages
because they're just too cumbersome), and whenever I run on an existing
folder, I ensure that it is expunged/compacted beforehand.  It's
definitely a good warning, though, especially when trying to train on
folders with false hits moved out.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by RW <rw...@googlemail.com>.

On Tue, 18 Jun 2013 13:13:56 -0600 (MDT)
Amir Caspi wrote:

> Well, I'm not really concerned about getting any header-related SA
> rules to hit, for these tests.  As I mentioned previously, my primary
> concern right now is the disconnect between the Bayes score during
> the automatic MTA delivery and during a manual spamc processing.  I'm
> going to try training my database in a different way, using the
> on-server Spam mbox instead of the Eudora mbox, to see if I can get
> better results (e.g. if Eudora's mbox format is simply not correct).
> [The lack of envelope From is an artifact of copy/paste from
> Eudora... and in Eudora's mbox format, the envelope From is also
> stripped for some unknown reason. 

That's set on delivery into a spool file, but IIRC it's not transmitted
in POP or IMAP (IMAP has a concept of an envelope but it's not the same
thing). Some clients put an address there for the sake of form, but
it's a bit pointless.

One thing to watch out for is that a mailbox may contain hidden deleted
mail that remains there until the mail client compacts/expunges the
mailbox. For that reason I prefer explicit training folders rather than
folders where misclassified mails have been moved-out.

Re: New rule for HTML spam, using comments?

Posted by Amir Caspi <ce...@3phase.com>.

On Tue, June 18, 2013 1:01 pm, Martin Gregorie wrote:
> The main thing I notice is that there are only two Received: headers,
> and no envelope-From so IMO you're hoping for too much from the
> header-related SA rules simply because there's very little for SA to get
> its teeth into.

Well, I'm not really concerned about getting any header-related SA rules
to hit, for these tests.  As I mentioned previously, my primary concern
right now is the disconnect between the Bayes score during the automatic
MTA delivery and during a manual spamc processing.  I'm going to try
training my database in a different way, using the on-server Spam mbox
instead of the Eudora mbox, to see if I can get better results (e.g. if
Eudora's mbox format is simply not correct).  [The lack of envelope From
is an artifact of copy/paste from Eudora... and in Eudora's mbox format,
the envelope From is also stripped for some unknown reason.  I'm really
beginning to doubt Eudora's storage format for purposes of spam
identification, though maybe I'm just being paranoid and the real cause is
something else.]

I'll probably add a .pw ban as well, but that's a separate issue.  And,
the _original_ subject of this email was about a new rule for HTML comment
gibberish, which I would still love, but which is also unrelated to
headers.

Thanks. =)

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Martin Gregorie <ma...@gregorie.org>.

On Tue, 2013-06-18 at 20:01 +0100, Martin Gregorie wrote:
> BTW, I just ran through 848 messages on this fairly average host (Lenovo
> R61i [Intel Core Duo at 1.6GHz, 3GB RAM) running Fedora 18. The first
> run averaged 1095 mS/message and the second averaged 96 mS/message, so I
> don't think John's STYLE_GIBBERISH rule is doing any harm.
> 
96 mS/message should read 696 mS/message.


Martin

Re: New rule for HTML spam, using comments?

Posted by Martin Gregorie <ma...@gregorie.org>.

On Tue, 2013-06-18 at 11:18 -0600, Amir 'CG' Caspi wrote:
> At 8:58 AM -0400 06/18/2013, Ben Johnson wrote:
> >a.) You are copying/pasting the body of the email, but not the headers.
> 
> No, I am copying the headers... however, I am using Eudora (ancient, 
> I know) as a mail client, and it's possible the headers are not 
> properly formatted.  For example, for SpamCop I have to use their 
> "workaround" script.  I don't know what exactly is mal-formed, though.
> 
Your headers look OK to me visually. After adding .pw to my banned
countries list I ran it through my SA copy and got three URIBL hits and
two of my rules (which work on headers) got hits too.

The main thing I notice is that there are only two Received: headers,
and no envelope-From so IMO you're hoping for too much from the
header-related SA rules simply because there's very little for SA to get
its teeth into.

BTW, I just ran through 848 messages on this fairly average host (Lenovo
R61i [Intel Core Duo at 1.6GHz, 3GB RAM) running Fedora 18. The first
run averaged 1095 mS/message and the second averaged 96 mS/message, so I
don't think John's STYLE_GIBBERISH rule is doing any harm.

Part of the speed-up between runs will be due to buffer/RAM optimisation
but the script I used for the second run does fractionally less
processing on the spamc output and almost certainly a lot of the
difference is due to caching in my local DNS (on a separate local host).

Martin

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

Replies to multiple folks below...

At 1:42 PM -0400 06/18/2013, Kris Deugau wrote:
>Try opening the on-disk file with Notepad (or your favourite text editor
>on *nix).  If you see the same thing you see when you hit the "blah blah
>blah" button in Eudora, you should be OK.  If not...

I've done that and I think I see the same thing.  Indeed, as I 
mentioned earlier, I use the on-disk file to pipe into sa-learn 
directly.  However, I'm not quite sure that I trust Eudora's on-disk 
file to actually be what an mbox format is supposed to be, at this 
point.  That is, I'm wondering if Eudora's on-disk storage is also 
somehow not correct, compared to what the incoming mail format is.

At 7:44 PM +0200 06/18/2013, Axb wrote:
>  a simple, fast & cheap URI rule would catch these
>
>and to make it even more efficient - reject anything with HELO or 
>sender using .pw and save lots of cycles.

Sure, but my problem right now is not the .pw spams specifically. 
It's the fact that I'm getting different results when the spam is 
first processed compared to when I run it through spamc manually.  In 
some cases I've done this literally within seconds of receiving the 
spam and STILL got different scores (see earlier email).

At 1:57 PM -0400 06/18/2013, Ben Johnson wrote:
>For the sake of troubleshooting, can you try accessing the mail by some
>other means, e.g., opening the file directly from the filesystem?

See reply to Kris above.  I think mbox is plaintext, yes... but 
Eudora strips attachments and places them in separate directories so 
they are not in the monolithic mbox.  That's ONE way Eudora is 
different than other clients... I'm wondering if there are yet more 
differences, which would explain why the message in the mbox is not 
identical to what originally was delivered by the MTA.

>How would anything ever be flagged with a score higher than BAYES_00 if
>this were to be the problem? Didn't you report a score of BAYES_99 in
>one of your tests?

The Bayes99 I reported earlier was from running it manually. 
However, yes, I do get high Bayes scores on auto-classified spam... 
I've been perusing my Spam folder (where the MTA dumps anything with 
X-Spam-Status: YES) and a number of the TP hits from SA do show 
Bayes99 (though they often also show lower scores).

Clearly, sa-learn can parse the Eudora mbox format... I'm just 
wondering if there's something about it that makes it sufficiently 
different from the raw mail delivered by the MTA that is confusing 
Bayes.

>How are you feeding the messages to sa-learn? Are you not just passing
>the email file, e.g., /var/vmail/example.com/...? Why copy from Eudora
>and paste into a temporary file when you can just point sa-learn
>straight to the message on disk?

Eudora is run on my laptop; SA is run on the server (UNIX). 
Therefore, I can't point SA directly to the message on disk.  The 
server also uses mbox, not maildir, so I can't point to individual 
messages, only whole mailboxes.

I do the copy/paste when I want to run individual messages 
manually... not through sa-learn, but through spamc.  This is to 
check why some messages seem to get low Bayes scores when delivered 
by the MTA... in many cases I get much higher scores when I run it 
manually, which is making me question the way my DB is getting 
trained.  (For reference, both the MTA and my manual calls are 
running the message through spamc with the same user DB, so they 
_should_ return identical scores if the manually-fed message is 
identical... which is why I'm now starting to think it is NOT 
identical, and why I'm now questioning how the training is done.)

When I run the Eudora mbox through training, I literally just copy 
the mbox from my laptop to the server, then run:

sa-learn --no-sync --progress --spam --mbox Eudora_Junk

(The journal auto-syncs shortly afterwards.)

Actually, I do have one step in there: I have to change the CR 
linefeeds that Eudora uses into LF (newline) linefeeds that UNIX 
uses... but that's the only change and it should be fully compliant 
with what's passing through the MTA.

>Do you retain your training corpus? This may be one of those instances
>in which the best way to debug the problem is to wipe and retrain Bayes.
>Of course, that can be a nightmare if you don't retain the messages that
>you've trained as ham and spam.

I don't have the corpus because the SA installation that I have is 
through Parallels Pro Control Panel.  It initially ships completely 
untrained, and when I deployed it about 6 years ago, I didn't know 
much about SA nor the need for training.  The DB has been 
autolearning over the past 5-6 years on its own.  It is only within 
the last two months that I've been manually trying to teach it.

So, no, I don't have a corpus of spam and ham on which to train.  I 
do have a Spam mailbox with about 1000 messages (both TPs and FNs), 
and of course I have my inbox with about 3300 messages in it (and 
tens of thousands more in archive folders)... in principle, I could 
probably train on these mboxes.  However, if Eudora's mbox formatting 
is indeed the problem, it means I will need to change how I store 
things, like switching email clients (which I should probably do 
anyway given how ancient and unsupported Eudora is, but you know how 
hard it is to switch clients), or at the very least changing how the 
server is storing/delivering my mail.

I wonder if there's any better way to debug this.

Thanks for all the help so far.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Axb <ax...@gmail.com>.

On 06/18/2013 07:18 PM, Amir 'CG' Caspi wrote:
> Either way, I am _trying_ to copy the entire message.  Not sure what is
> misformatted there.  If you take a look at my two pasted examples (links
> below for convenience), those are direct copy/paste from Eudora's "raw
> source" view.  Any idea what is malformed?  Do I need an extra newline
> between the header and body, or something more complicated?
>
> http://pastebin.com/HD0rNdxU
> http://pastebin.com/Zswg77Ds

  a simple, fast & cheap URI rule would catch these

and to make it even more efficient - reject anything with HELO or sender 
using .pw and save lots of cycles.
Any FP will be reported and you can WL
(been rejecting this from the first day they showed up in spams - still 
waiting for a FP report.)

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 8:58 AM -0400 06/18/2013, Ben Johnson wrote:
>a.) You are copying/pasting the body of the email, but not the headers.

No, I am copying the headers... however, I am using Eudora (ancient, 
I know) as a mail client, and it's possible the headers are not 
properly formatted.  For example, for SpamCop I have to use their 
"workaround" script.  I don't know what exactly is mal-formed, though.

I should admit at this point that much of my sa-learn has been on 
Eudora's mboxes, by the way.  That is, I would take the Eudora mbox 
and sa-learn on that.  Eudora is supposed to use standard mbox 
format, but I'm wondering if maybe it's not so standard after all...

Either way, I am _trying_ to copy the entire message.  Not sure what 
is misformatted there.  If you take a look at my two pasted examples 
(links below for convenience), those are direct copy/paste from 
Eudora's "raw source" view.  Any idea what is malformed?  Do I need 
an extra newline between the header and body, or something more 
complicated?

http://pastebin.com/HD0rNdxU
http://pastebin.com/Zswg77Ds

>b.) You are running Bayes as two different users when you perform your

No, I have been careful for that.  You saw that I pasted the maillog 
entries... notice that spamd runs as setuid.  I made sure the same 
userid was in the logs, and in my command.

>Have a look at the thread I cited and see if anything jumps-out at you.

Will do, but unfortunately, I don't think the problem is as clear cut 
as (b) ... maybe it's (a) though, in which case I wonder if I have to 
modify my Eudora mboxes before learning on them.

Thanks.

						-- Amir

Re: New rule for HTML spam, using comments?

Posted by Ben Johnson <be...@indietorrent.org>.


On 6/18/2013 5:31 AM, Amir 'CG' Caspi wrote:
> At 4:37 PM -0400 06/14/2013, Alex wrote:
>> On Fri, Jun 14, 2013 at 4:18 PM, Amir 'CG' Caspi <ce...@3phase.com>
>> wrote:
>>  > I wonder if there's some
>>  > difference between running spamassassin manually on the message versus
>>  > running spamd.
>>
>> I think the only difference would be if spamd somehow didn't recognize
>> all the locations for your rules.
> 
> OK, I've got some more weirdness here.  I just received two FN spams...
> one had bayes00, another bayes50.  To test what the heck might be going
> on, I ran both of the emails through spamc manually... this SHOULD
> recreate the same thing that occurs when sendmail delivers the email and
> spamc gets run automatically.
> 
> The first email, which was bayes00 originally, hit with bayes99 when I
> ran it manually through spamc.  There were only a few minutes between
> the first and second run (see timestamps below)... nothing very
> important happened to the Bayes DB between those two runs.  The second
> email, bayes50, stayed exactly the same (also bayes50).  I looked
> through the /var/log/maillog to see if I could figure out some
> difference between the two runs, but they look basically identical.
> 
> The only difference I can figure is that the second (manual) run happens
> on mail source that I copy/paste from my email program... that is, it's
> pure text, copied and pasted.  The first (automatic) run is on the mail
> as it enters the system, which might be somehow formatted differently. 
> All of my sa-learn training is done directly on my mbox files, which
> perhaps is more akin to copy/paste than anything else...
> 
> Anyone know what the hell is going on here?  For reference, here is the
> maillog entry for the bayes00 message when it went through automatically:
> 
> Jun 18 05:00:32 kismet sendmail[27721]: r5I90WGI027721:
> from=<Ju...@stetacusesse.us>, size=16502, class=0, nrcpts=1,
> msgid=<NN...@efeo6h8pf.stetacusesse.us>,
> proto=ESMTP, relay=root@localhost
> Jun 18 05:00:32 kismet sendmail[27707]: r5I90U4N027657:
> to=<us...@domain.com>, delay=00:00:01, xdelay=00:00:00,
> mailer=virthostmail, pri=136089, relay=domain.com, dsn=2.0.0, stat=Sent
> (r5I90WGI027721 Message accepted for delivery)
> Jun 18 05:00:32 kismet spamd[27586]: spamd: connection from
> localhost.localdomain [127.0.0.1] at port 53424
> Jun 18 05:00:32 kismet spamd[27586]: spamd: setuid to user@domain.com
> succeeded
> Jun 18 05:00:32 kismet spamd[27586]: spamd: processing message
> <NN...@efeo6h8pf.stetacusesse.us> for
> user@domain.com:22001
> Jun 18 05:00:33 kismet spamd[27586]: spf: lookup failed: Can't locate
> object method "new_from_string" via package "Mail::SPF::v1::Record" at
> /usr/lib/perl5/vendor_perl/5.8.8/Mail/SPF/Server.pm line 524.
> Jun 18 05:00:37 kismet spamd[27586]: pyzor: [27730] error: TERMINATED,
> signal 15 (000f)
> Jun 18 05:00:37 kismet spamd[27586]: spamd: clean message (-1.1/5.0) for
> user@domain.com:22001 in 5.0 seconds, 16781 bytes.
> Jun 18 05:00:37 kismet spamd[27586]: spamd: result: . -1 -
> BAYES_00,HTML_EXTRA_CLOSE,HTML_IMAGE_RATIO_08,HTML_MESSAGE,RDNS_NONE
> scantime=5.0,size=16781,user=user@domain.com,uid=22001,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=53424,mid=<NN...@efeo6h8pf.stetacusesse.us>,
> bayes=0.000000,autolearn=no
> 
> 
> And here is when it went through manually:
> 
> Jun 18 05:05:45 kismet spamd[27984]: spamd: connection from
> localhost.localdomain [127.0.0.1] at port 53447
> Jun 18 05:05:45 kismet spamd[27984]: spamd: setuid to user@domain.com
> succeeded
> Jun 18 05:05:45 kismet spamd[27984]: spamd: processing message
> <NN...@efeo6h8pf.stetacusesse.us> for
> user@domain.com:22001
> Jun 18 05:05:45 kismet spamd[27984]: spf: lookup failed: Can't locate
> object method "new_from_string" via package "Mail::SPF::v1::Record" at
> /usr/lib/perl5/vendor_perl/5.8.8/Mail/SPF/Server.pm line 524.
> Jun 18 05:05:47 kismet spamd[27984]: spamd: identified spam (6.0/5.0)
> for user@domain.com:22001 in 2.2 seconds, 16223 bytes.
> Jun 18 05:05:47 kismet spamd[27984]: spamd: result: Y 6 -
> BAYES_99,MISSING_MIME_HB_SEP,RDNS_NONE,T_MIME_NO_TEXT,URIBL_BLACK
> scantime=2.2,size=16223,user=user@domain.com,uid=22001,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=53447,mid=<NN...@efeo6h8pf.stetacusesse.us>,bayes=1.000000,autolearn=no
> 
> 
> 
> So... what the heck is going on?  I see basically no difference between
> the two maillog entries.  The only difference between the two runs, as
> far as I can tell, is that pyzor died on the first one (and I don't know
> why, but that shouldn't have ANY effect on the Bayes score), and the
> manual run was using the copy/paste from my mail program.
> 
> But, as mentioned, the bayes50 spam looked identical for both the
> automatic and manual runs.
> 
> Anyone have any idea what the heck is going on, and how I can fix it?
> 
> Is my Bayes DB worthless because I've been training it on MBOX format
> (i.e. ASCII), but when it runs the first time around, it's running on
> binary (MIME) instead?  If so, how can I fix this -- do I need to store
> my mail in some different format instead of MBOX?  (Except that sendmail
> delivers my mail in MBOX format...)
> 
> Thanks.
> 
>                         --- Amir

While my setup is slightly different (I use AMaViS), I had a similar
problem (discrepancies in Bayes scores for the same message) and with
the help of this list, we went through the entire setup -- rather
exhaustively. Here is that thread:
http://mail-archives.apache.org/mod_mbox/spamassassin-users/201301.mbox/%3C50EDEBAD.2030104@indietorrent.org%3E
.

Basically, it sounds as though:

a.) You are copying/pasting the body of the email, but not the headers.
I made the same mistake. I use Thunderbird, and to view the actual
message source there, one presses Ctrl+U. *That's* the text you would
want to copy and paste.

b.) You are running Bayes as two different users when you perform your
tests. It's possible that SpamAssassin has its own user for executing
Bayes-related tasks, but you're using your own system account, for
example, which would explain the observed behavior. (By default, each
user has his own Bayes DB; it is possible to "hard-code" the Bayes user,
which is exactly what I had to do, for more reason than one.)

I sincerely doubt that this is a problem with your mailbox format.

Have a look at the thread I cited and see if anything jumps-out at you.

-Ben

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 4:37 PM -0400 06/14/2013, Alex wrote:
>On Fri, Jun 14, 2013 at 4:18 PM, Amir 'CG' Caspi <ce...@3phase.com> wrote:
>  > I wonder if there's some
>  > difference between running spamassassin manually on the message versus
>  > running spamd.
>
>I think the only difference would be if spamd somehow didn't recognize
>all the locations for your rules.

OK, I've got some more weirdness here.  I just received two FN 
spams... one had bayes00, another bayes50.  To test what the heck 
might be going on, I ran both of the emails through spamc manually... 
this SHOULD recreate the same thing that occurs when sendmail 
delivers the email and spamc gets run automatically.

The first email, which was bayes00 originally, hit with bayes99 when 
I ran it manually through spamc.  There were only a few minutes 
between the first and second run (see timestamps below)... nothing 
very important happened to the Bayes DB between those two runs.  The 
second email, bayes50, stayed exactly the same (also bayes50).  I 
looked through the /var/log/maillog to see if I could figure out some 
difference between the two runs, but they look basically identical.

The only difference I can figure is that the second (manual) run 
happens on mail source that I copy/paste from my email program... 
that is, it's pure text, copied and pasted.  The first (automatic) 
run is on the mail as it enters the system, which might be somehow 
formatted differently.  All of my sa-learn training is done directly 
on my mbox files, which perhaps is more akin to copy/paste than 
anything else...

Anyone know what the hell is going on here?  For reference, here is 
the maillog entry for the bayes00 message when it went through 
automatically:

Jun 18 05:00:32 kismet sendmail[27721]: r5I90WGI027721: 
from=<Ju...@stetacusesse.us>, size=16502, class=0, 
nrcpts=1, 
msgid=<NN...@efeo6h8pf.stetacusesse.us>, 
proto=ESMTP, relay=root@localhost
Jun 18 05:00:32 kismet sendmail[27707]: r5I90U4N027657: 
to=<us...@domain.com>, delay=00:00:01, xdelay=00:00:00, 
mailer=virthostmail, pri=136089, relay=domain.com, dsn=2.0.0, 
stat=Sent (r5I90WGI027721 Message accepted for delivery)
Jun 18 05:00:32 kismet spamd[27586]: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 53424
Jun 18 05:00:32 kismet spamd[27586]: spamd: setuid to user@domain.com succeeded
Jun 18 05:00:32 kismet spamd[27586]: spamd: processing message 
<NN...@efeo6h8pf.stetacusesse.us> for 
user@domain.com:22001
Jun 18 05:00:33 kismet spamd[27586]: spf: lookup failed: Can't locate 
object method "new_from_string" via package "Mail::SPF::v1::Record" 
at /usr/lib/perl5/vendor_perl/5.8.8/Mail/SPF/Server.pm line 524.
Jun 18 05:00:37 kismet spamd[27586]: pyzor: [27730] error: 
TERMINATED, signal 15 (000f)
Jun 18 05:00:37 kismet spamd[27586]: spamd: clean message (-1.1/5.0) 
for user@domain.com:22001 in 5.0 seconds, 16781 bytes.
Jun 18 05:00:37 kismet spamd[27586]: spamd: result: . -1 - 
BAYES_00,HTML_EXTRA_CLOSE,HTML_IMAGE_RATIO_08,HTML_MESSAGE,RDNS_NONE 
scantime=5.0,size=16781,user=user@domain.com,uid=22001,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=53424,mid=<NN...@efeo6h8pf.stetacusesse.us>, 
bayes=0.000000,autolearn=no

And here is when it went through manually:

Jun 18 05:05:45 kismet spamd[27984]: spamd: connection from 
localhost.localdomain [127.0.0.1] at port 53447
Jun 18 05:05:45 kismet spamd[27984]: spamd: setuid to user@domain.com succeeded
Jun 18 05:05:45 kismet spamd[27984]: spamd: processing message 
<NN...@efeo6h8pf.stetacusesse.us> for 
user@domain.com:22001
Jun 18 05:05:45 kismet spamd[27984]: spf: lookup failed: Can't locate 
object method "new_from_string" via package "Mail::SPF::v1::Record" 
at /usr/lib/perl5/vendor_perl/5.8.8/Mail/SPF/Server.pm line 524.
Jun 18 05:05:47 kismet spamd[27984]: spamd: identified spam (6.0/5.0) 
for user@domain.com:22001 in 2.2 seconds, 16223 bytes.
Jun 18 05:05:47 kismet spamd[27984]: spamd: result: Y 6 - 
BAYES_99,MISSING_MIME_HB_SEP,RDNS_NONE,T_MIME_NO_TEXT,URIBL_BLACK 
scantime=2.2,size=16223,user=user@domain.com,uid=22001,required_score=5.0,rhost=localhost.localdomain,raddr=127.0.0.1,rport=53447,mid=<NN...@efeo6h8pf.stetacusesse.us>,bayes=1.000000,autolearn=no

So... what the heck is going on?  I see basically no difference 
between the two maillog entries.  The only difference between the two 
runs, as far as I can tell, is that pyzor died on the first one (and 
I don't know why, but that shouldn't have ANY effect on the Bayes 
score), and the manual run was using the copy/paste from my mail 
program.

But, as mentioned, the bayes50 spam looked identical for both the 
automatic and manual runs.

Anyone have any idea what the heck is going on, and how I can fix it?

Is my Bayes DB worthless because I've been training it on MBOX format 
(i.e. ASCII), but when it runs the first time around, it's running on 
binary (MIME) instead?  If so, how can I fix this -- do I need to 
store my mail in some different format instead of MBOX?  (Except that 
sendmail delivers my mail in MBOX format...)

Thanks.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Martin Gregorie <ma...@gregorie.org>.

On Fri, 2013-06-14 at 16:37 -0400, Alex wrote:

> > The rules definitely exist on my system.  I wonder if there's some
> > difference between running spamassassin manually on the message versus
> > running spamd.  The message I pasted was run through spamc/spamd.  Is there
> > something that I've misconfigured that might cause spamd to run differently
> > and skip some tests, that spamassassin would manually pick up?
> 

In any case that's easy to fix. For starters, keep your test message
collection on another box and use it the develop and test rules before
putting them live. You run spamd on this and replace the MTA with a
script that can pass test messages in using spamc, Its logic needs to be
something like this:  

    # start spamd
    for m in $*
    do
       echo "====== starting $m ======"
       spamc <$f | grep '^X-Spam'
       echo "======  end of $m  ======"
    done
    # stop spamd

grep is there so you just see the headers added by SA. You should also
clean messages by removing headers that start with 'X-Spam' before you
add them to your collection - having old SA headers in the collection is
just confusing. 

At least, this is how my test system is organised, along with scripts to
starting, stopping and checking the status of spamd. I also have scripts
to run spamassassin in --lint -D mode to check for rule errors, to clean
SA headers out of newly collected messages and to export configuration
file sets to the live system and restart it so the new rules will take
effect.

HTH

Martin

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 4:37 PM -0400 06/14/2013, Alex wrote:
>I think the only difference would be if spamd somehow didn't recognize
>all the locations for your rules. Perhaps create a rule that you know
>will hit with a very low score in each directory that contains rules.
>Maybe there's a way to run spamd in the foreground with debugging,
>like there is with amavisd.

So, I again ran the email through spamassassin manually and, after 
restarting spamd, I ran it through spamc/spamd.  In both cases, I got 
bayes99 hits, along with LONGWORDS and MIME_NO_TEXT.  There was a 
minor difference in scores and tests between spamassassin and 
spamc/spamd (the former got a hit on NO_DNS_FOR_FROM, while the 
latter got a hit on DKIM_ADSP_NXDOMAIN), but they were pretty 
equivalent for the most part.

So, at least right now, it seems I _should_ be getting the same (or 
similar) scores through both methods.  I still have no idea why spamd 
wouldn't have given bayes99 previously, unless it really was some 
sort of change in the rules and spamd needed a restart.  (If that's 
the case, I'll just add a cron job to reboot spamd daily.)

Though, on that note, why would spamassassin hit on NO_DNS_FOR_FROM 
but not DKIM_ADSP_NXDOMAIN, while spamc/spamd would hit on the second 
and not the first?  They are getting identical input files... 
literally (I'm piping the same file into both commands).

Thanks.
						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 11:43 PM +0100 06/14/2013, Martin Gregorie wrote:
>Are you sure? Take a look at how sa_update is getting run to make sure
>that it is doing what you expect.

Yes, I'm sure.  I looked at the update script (in my case, it's 
called update_spamassassin, due to the way Parallels Pro configures 
their services)... all it does is call sa_update.  It does nothing to 
restart spamd.  This is likely an artifact of Parallels Pro messing 
with things in their distributions.

I also checked the actual spamd process, and this confirmed it had 
not been restarted since January.  This is the 'ps aux' output before 
I restarted the process:

root      2286  0.0  1.1  44992 37632 ?        Ss   Jan08   5:49 
/usr/bin/spamd --max-conn-per-child=1 -d -c -m5 -H -r 
/var/run/spamd.pid

On the other hand, the files in /var/lib/spamassassin/ get updated 
nearly every night.  So, sa_update is running, but spamd is not 
restarting.

I've added a line into that script to restart spamd nightly.  It 
should start tonight.  I guess I should only restart it if sa_update 
actually returns 0... but there's not too much harm in just doing it 
nightly anyway.

Thanks.
						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Martin Gregorie <ma...@gregorie.org>.

On Fri, 2013-06-14 at 15:47 -0600, Amir 'CG' Caspi wrote:

> The only thing I can _possibly_ think of is that sa-update is run 
> nightly, but spamd doesn't get rebooted nightly...
>
Are you sure? Take a look at how sa_update is getting run to make sure
that it is doing what you expect. 

sa_update is normally run by the 'sa_update' script, which lives
in /etc/cron.daily. This script restarts spamd if /usr/bin/sa-update
returns 0, which only happens if the rules were updated.

Martin

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 4:37 PM -0400 06/14/2013, Alex wrote:
>Yeah, but not bayes20. That's bad for sure. You should start
>collecting now, or pull a few hundred from your recent quarantine and
>use those, along with people's mail folders.

Well, I got bayes99 when I ran spamassassin manually just now.  So, I 
really have no idea.  I never trained the DB with this particular 
email (the last training I did was the day before receiving this 
email), and clearly spamd got a much lower Bayes score, but 
spamassassin manually got a higher one.

I officially have no clue what the heck is going on.

I should reiterate that this SA is running in a system with Parallels 
Pro Control Panel (virtual site hosting).  spamd is configured to run 
as setuid user; each user has his/her own Bayes DB (as it should be, 
I think).  spamd should be loading my DB when it runs, the same way 
spamassassin is doing.  I have zero idea why I'm getting different 
results from spamd and spamassassin, nor do I have any idea how to 
check if it's properly loading my DB, etc.  I _do_ find that the 
bayes files in my (user) .spamassassin directory are being updated 
regularly, so I'm fairly confident spamd is using my personal DB, the 
same way that spamassassin manually should be...

>I think the only difference would be if spamd somehow didn't recognize
>all the locations for your rules. Perhaps create a rule that you know
>will hit with a very low score in each directory that contains rules.
>Maybe there's a way to run spamd in the foreground with debugging,
>like there is with amavisd.

I don't think there's any problem with the rule locations, but I have 
no idea.  Shouldn't spamd run with exactly the same setup as 
spamassassin?

The only thing I can _possibly_ think of is that sa-update is run 
nightly, but spamd doesn't get rebooted nightly... if some of the 
rules have changed since the last time spamd was started (in my case, 
in January -- I have a very stable server!), then maybe spamd won't 
pick them up, but running spamassassin manually certainly will.

Does this sound like a legit potential reason for the discrepancy? 
(Although, would the bayes rules have changed at all, or would that 
be based ONLY on my DB?  If the latter, I'm still stumped as to how 
spamd got bayes20 and spamassassin got bayes99 on the same email.)

I've restarted spamd right now and I guess we'll see how the next FN 
I get compares between spamd and spamassassin.

Thanks.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Alex <my...@gmail.com>.

Hi,

On Fri, Jun 14, 2013 at 4:18 PM, Amir 'CG' Caspi <ce...@3phase.com> wrote:
> At 9:43 PM -0400 06/13/2013, Alex wrote:
>>
>> I'd say if you have any that are hitting bayes20 or lower, your
>> database is not working properly and you should probably start over.
>
> Not quite sure I want to do that... I don't really have a sufficient corpus
> of mail for good training.  It's working well in general, just missing these
> particular entries.  As I saw from your most recent message, you're also
> getting low Bayes scores on some similar examples... so it seems like these
> things are somewhat successful in confusing the Bayes analysis, at least on
> some DBs and with some emails (different emails confuse different DBs).

Yeah, but not bayes20. That's bad for sure. You should start
collecting now, or pull a few hundred from your recent quarantine and
use those, along with people's mail folders.

>> I thought you may have manually modified the body because this looks
>> unique:
>>
>>    <x-html><!x-stuff-for-pete base=
>>
>> Do your other FNs have this? If so, you could consider generating a
>> rule from it.
>
> Almost all of my HTML FNs have this.  However, almost all of my legitimate
> HTML email (TNs) also have this (regardless of source, i.e. whether it comes
> from a large company opt-in ad or whether it comes from a friend's direct
> email).  It would appear to be some sort of XHTML email standard.  Filtering
> on this would be disastrous, at least for the email I receive.

Good to know.

>> Search your installation and see if the two rules even exist on your
>> system.
>
> The rules definitely exist on my system.  I wonder if there's some
> difference between running spamassassin manually on the message versus
> running spamd.  The message I pasted was run through spamc/spamd.  Is there
> something that I've misconfigured that might cause spamd to run differently
> and skip some tests, that spamassassin would manually pick up?

I think the only difference would be if spamd somehow didn't recognize
all the locations for your rules. Perhaps create a rule that you know
will hit with a very low score in each directory that contains rules.
Maybe there's a way to run spamd in the foreground with debugging,
like there is with amavisd.

Regards,
Alex

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 9:43 PM -0400 06/13/2013, Alex wrote:
>I'd say if you have any that are hitting bayes20 or lower, your
>database is not working properly and you should probably start over.

Not quite sure I want to do that... I don't really have a sufficient 
corpus of mail for good training.  It's working well in general, just 
missing these particular entries.  As I saw from your most recent 
message, you're also getting low Bayes scores on some similar 
examples... so it seems like these things are somewhat successful in 
confusing the Bayes analysis, at least on some DBs and with some 
emails (different emails confuse different DBs).

>I thought you may have manually modified the body because this looks unique:
>
>    <x-html><!x-stuff-for-pete base=
>
>Do your other FNs have this? If so, you could consider generating a
>rule from it.

Almost all of my HTML FNs have this.  However, almost all of my 
legitimate HTML email (TNs) also have this (regardless of source, 
i.e. whether it comes from a large company opt-in ad or whether it 
comes from a friend's direct email).  It would appear to be some sort 
of XHTML email standard.  Filtering on this would be disastrous, at 
least for the email I receive.

>Search your installation and see if the two rules even exist on your system.

The rules definitely exist on my system.  I wonder if there's some 
difference between running spamassassin manually on the message 
versus running spamd.  The message I pasted was run through 
spamc/spamd.  Is there something that I've misconfigured that might 
cause spamd to run differently and skip some tests, that spamassassin 
would manually pick up?

Thanks.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Alex <my...@gmail.com>.

Hi,

>> After looking at it more closely, it's also only hitting bayes20 for
>> you. Do the others also score so low? This hits bayes99 on my system.
>
> The ones that SA doesn't catch, yes, they are typically low.  I have some
> that are bayes50, some bayes20, some bayes00.  Any that are bayes99 are
> almost certainly in my spam folder and I'm typically not looking at them (I
> don't have that much time to look at spam, so I prefer to look at FN rather
> than TP).

I'd say if you have any that are hitting bayes20 or lower, your
database is not working properly and you should probably start over.

It seems that populating it further with ham and FNs won't eliminate
the incorrect classification of the spam that's there already.

I thought you may have manually modified the body because this looks unique:

   <x-html><!x-stuff-for-pete base=

Do your other FNs have this? If so, you could consider generating a
rule from it.

Try running it as "spamassassin -t -D < sample > /tmp/sample.out 2>&1

Then go through /tmp/sample.out. You should see it processing the
config files. Make sure it's including all the rules from your
installation.

You might also create a header or body rule to tag something unique:

header  MY_SUBJ_RULE Subject =~ /Mobility Solutions from Hoveround/i
score     MY_SUBJ_RULE 1.0

rawbody  MY_BODY_RULE /Enjoy Life Again with Mobility Solutions/i
score       MY_BODY_RULE 1.0

>> It also hits the "LONGWORDS" rule and "MIME_NO_TEXT", pushing it over
>> to be spam. Have you otherwise modified the body?
...
> I'm not sure why those rules are hitting for you and not for me.  I wonder
> if something is misconfigured on my installation.  I should disclose that my
> installation is on a Parallels Pro Control Panel machine... PPCP ships with
> an SA rpm, but I've updated it with the version from RPMforge
> (spamassassin-3.3.1-3.el5.rf, which is the latest one on that repo).
> sa-update is run nightly via cron.

Search your installation and see if the two rules even exist on your system.

# pwd
/var/lib/spamassassin/3.003002/updates_spamassassin_org
# grep LONGWORDS 20_body_tests.cf
...
meta LONGWORDS         (__LONGWORDS_A + __LONGWORDS_B + __LONGWORDS_C > 1)

When running it in debug mode, you should see something like this:

Jun 13 21:36:08.949 [1771] dbg: rules: ran body rule __LONGWORDS_B
======> got hit: "committees udometer forgets operated defoliated
between choose indeed micromanagement "

Regards,
Alex

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 8:04 PM -0400 06/13/2013, Alex wrote:
>After looking at it more closely, it's also only hitting bayes20 for
>you. Do the others also score so low? This hits bayes99 on my system.

The ones that SA doesn't catch, yes, they are typically low.  I have 
some that are bayes50, some bayes20, some bayes00.  Any that are 
bayes99 are almost certainly in my spam folder and I'm typically not 
looking at them (I don't have that much time to look at spam, so I 
prefer to look at FN rather than TP).

It's quite possible my Bayes DB is simply not sufficiently populated. 
I've trained it on about 750 FN spams over the last few weeks, but 
otherwise it's been mostly autolearning over the last 5 years or so. 
(As I said, I only recently started to get into the guts of SA, due 
to the increasing spam problem.)  On the upside, I get zero FPs even 
with a spam threshold of 5, but I'm clearly getting a lot of FNs.

>It also hits the "LONGWORDS" rule and "MIME_NO_TEXT", pushing it over
>to be spam. Have you otherwise modified the body?

The only thing I did to the body was to change potentially unique 
identifying strings in the URIs (just in case the spammer can look 
that pastebin up and track those strings to my email address and DoS 
or super-spam me... I am paranoid, I know, but figured it was easy 
enough to change the URI).  Those were just replaced with XXXXXXX 
appropriately.  The body is otherwise completely unmodified, and only 
the headers were slightly modified (again just to change host/email 
and other potentially unique identifiers).

I'm not sure why those rules are hitting for you and not for me.  I 
wonder if something is misconfigured on my installation.  I should 
disclose that my installation is on a Parallels Pro Control Panel 
machine... PPCP ships with an SA rpm, but I've updated it with the 
version from RPMforge (spamassassin-3.3.1-3.el5.rf, which is the 
latest one on that repo).  sa-update is run nightly via cron.

Any way to figure out why your rules are popping and mine aren't?

>The domain is also now listed in at least three RBLs.

By now, I expect so... I reported this spam to SpamCop (as I have 
been doing with all FN spam in the last month or so).

Thanks.

						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Alex <my...@gmail.com>.

Hi,

On Thu, Jun 13, 2013 at 7:36 PM, Amir 'CG' Caspi <ce...@3phase.com> wrote:
> At 7:25 PM -0400 06/13/2013, Alex wrote:
>>
>> I think people will start by telling you to block the pw domain
>
> Sure, but not all of the comment-laden spam is from the pw domain. It comes
> in from .net, .com, .us, and a bunch of other places as well.  This is just
> the one example I happened to pick "randomly." I'm happy to post some others
> (non-pw) if necessary.

Yes, just meant that you should start with doing that, at the least.
After looking at it more closely, it's also only hitting bayes20 for
you. Do the others also score so low? This hits bayes99 on my system.

It also hits the "LONGWORDS" rule and "MIME_NO_TEXT", pushing it over
to be spam. Have you otherwise modified the body?

The domain is also now listed in at least three RBLs.

Regards,
Alex

Re: New rule for HTML spam, using comments?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.

At 7:25 PM -0400 06/13/2013, Alex wrote:
>I think people will start by telling you to block the pw domain

Sure, but not all of the comment-laden spam is from the pw domain. 
It comes in from .net, .com, .us, and a bunch of other places as 
well.  This is just the one example I happened to pick "randomly." 
I'm happy to post some others (non-pw) if necessary.

Blocking the pw domain is certainly a thought, and thanks for the 
link... but there's plenty of comment-laden spam from non-pw domains, 
too.  Hence, why a rule might be a good idea, if it can be 
implemented.  (I am relatively new to SA's internal workings and 
don't know how to make such a rule, however.)

Thanks.
						--- Amir

Re: New rule for HTML spam, using comments?

Posted by Alex <my...@gmail.com>.

Hi,

> Lately, I've been getting hit with a LOT of this type of spam:
>
> http://pastebin.com/HD0rNdxU

I think people will start by telling you to block the pw domain

   From: Hoveround <ma...@xanti.shahphiler.pw>

More in this thread:

http://spamassassin.1065346.n5.nabble.com/pw-Palau-URL-domains-in-spam-td104383.html

Regards,
Alex