You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Happy Chap <sa...@happychap.plus.com> on 2010/08/04 10:23:32 UTC

Text contained in HTML comments causing BAYES_00 to classify as non-spam

Hi,

We started getting (over the last 2 months say) lots of spam, which
Spamassassin isn't picking up as spam. Analysing these, they all seem to be
of the same type where many paragraphs of random text are "hidden" inside an
HTML comment (either contained in <!-- --> or inbetween /* and */ "tags").

Because of this "hidden" text, these messages are triggering BAYES_00 which,
I think, is the major influence on them not being correctly identified by
Spamassassin as spam.

We're running a slightly old version of Spamassassin (v3.2.3) running on
SuSE 10.3 but do run sa-update's regularly to pick up new rules (which,
perhaps naively I thought was more important the upgrading Spamassassin
itself).

Has anyone got any advice on how to correctly identify these mails as spam?

Do I just need to upgrade to Spamassassin 3.3.0 (I'm assuming that this
probably won't make much difference because I'm already using the latest
rule sets thanks to sa-update)?

Any ideas/help would be very gratefully received as the users are now
getting restless and, bayes training isn't really helping. Thanks.

David.
-- 
View this message in context: http://old.nabble.com/Text-contained-in-HTML-comments-causing-BAYES_00-to-classify-as-non-spam-tp29342874p29342874.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Bowie Bailey <Bo...@BUC.com>.
 On 8/4/2010 6:07 PM, Happy Chap wrote:
>
> No, we're not using an SQL backend and every users has their own bayes
> database.

You mentioned previously that you are using 'sa-learn -u'.  I thought
that option only worked with SQL databases?

In my setup, I have lots of virtual users under the same UID with
per-user settings and bayes (using spamd's '-x' and
'--virtual-config-dir' options).  So when I run sa-learn, I explicitly
specify the database path to make sure it's learning to the right place
(sa-learn --dbpath /path/to/bayes ...).

-- 
Bowie

Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Happy Chap <sa...@happychap.plus.com>.


Karsten Bräckelmann-2 wrote:
> 
> 
> So when you confirmed by running sa-learn --dump magic previously, did
> you first su to the user in question? The Bayes database does exist in
> the user's $HOME/.spamassassin/, right?
>  

Yes, I had su'ed to that user and yes, they have their own bayes_seen,
bayes_toks, etc. in $HOME/.spamassassin


Karsten Bräckelmann-2 wrote:
> 
> 
> Despite running per-user, site-wide Bayes DB still is possible IIRC, if
> you e.g. use an SQL backend.
> 
> 

No, we're not using an SQL backend and every users has their own bayes
database.


Karsten Bräckelmann-2 wrote:
> 
> 
> Anyway, since you still get BAYES_00 on these, you really should have a
> close look at the tokens Bayes considers most confident. And why. With
> some training, it most certainly at least should level up near BAYES_50,
> not stay at 00. The tokens should help tell you why.
> 
> 

OK, will do.

Thanks again for your help Karsten.

David.
-- 
View this message in context: http://old.nabble.com/Text-contained-in-HTML-comments-causing-BAYES_00-to-classify-as-non-spam-tp29342874p29351738.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Wed, 2010-08-04 at 14:39 -0700, Happy Chap wrote:
> Bowie Bailey wrote:

> > Stupid question here, but are you sure you are training the same
> > database that SA is using?
> > 
> > This is a fairly frequent problem.  Common cases are:
> > 
> > 1) SA being called as 'mailuser' and you are doing manual training on
> > root's database.
> > 2) You are manually training everything to the 'mailuser' database, but
> > SA is actually using per-user databases.
> 
> Good question Bowie. 
> 
> I don't think that's happening. We do have a generic system-wide procmailrc
> but it's first command is for a DROPPRIVS, which I think/thought then runs
> as the specific user and in the procmail recipe a call is then made to spamc
> (although it is called without the -u option because, as I say, I think by
> issuing a DROPPRIVS it's running as that user so -u shouldn't be necessary).

*nod*

> If this doesn't sound right, by all means say - it's quite a while since i
> set all this up!
> 
> Training is definitely happening on a per user basis (ie. the script is
> calling sa-learn -u).

So when you confirmed by running sa-learn --dump magic previously, did
you first su to the user in question? The Bayes database does exist in
the user's $HOME/.spamassassin/, right?

Despite running per-user, site-wide Bayes DB still is possible IIRC, if
you e.g. use an SQL backend.


Anyway, since you still get BAYES_00 on these, you really should have a
close look at the tokens Bayes considers most confident. And why. With
some training, it most certainly at least should level up near BAYES_50,
not stay at 00. The tokens should help tell you why.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Happy Chap <sa...@happychap.plus.com>.


Bowie Bailey wrote:
> 
>  
> Stupid question here, but are you sure you are training the same
> database that SA is using?
> 
> This is a fairly frequent problem.  Common cases are:
> 
> 1) SA being called as 'mailuser' and you are doing manual training on
> root's database.
> 2) You are manually training everything to the 'mailuser' database, but
> SA is actually using per-user databases.
> 
> -- 
> Bowie
> 
> 

Good question Bowie. 

I don't think that's happening. We do have a generic system-wide procmailrc
but it's first command is for a DROPPRIVS, which I think/thought then runs
as the specific user and in the procmail recipe a call is then made to spamc
(although it is called without the -u option because, as I say, I think by
issuing a DROPPRIVS it's running as that user so -u shouldn't be necessary).
If this doesn't sound right, by all means say - it's quite a while since i
set all this up!

Training is definitely happening on a per user basis (ie. the script is
calling sa-learn -u).

Thanks, David.



-- 
View this message in context: http://old.nabble.com/Text-contained-in-HTML-comments-causing-BAYES_00-to-classify-as-non-spam-tp29342874p29351529.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Bowie Bailey <Bo...@BUC.com>.
 On 8/4/2010 4:24 PM, Happy Chap wrote:
> Bowie Bailey wrote:
>>  On 8/4/2010 4:23 AM, Happy Chap wrote:
>>
>> You ARE manually training bayes (sa-learn) on these missed spams,
>> right?  That is probably the most useful thing you can do if you are
>> getting Bayes_00 on them.
> Hi Bowie, oh yes, every night.

Stupid question here, but are you sure you are training the same
database that SA is using?

This is a fairly frequent problem.  Common cases are:

1) SA being called as 'mailuser' and you are doing manual training on
root's database.
2) You are manually training everything to the 'mailuser' database, but
SA is actually using per-user databases.

-- 
Bowie

Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Happy Chap <sa...@happychap.plus.com>.

Bowie Bailey wrote:
> 
>  On 8/4/2010 4:23 AM, Happy Chap wrote:
> 
> You ARE manually training bayes (sa-learn) on these missed spams,
> right?  That is probably the most useful thing you can do if you are
> getting Bayes_00 on them.
> 
> 

Hi Bowie, oh yes, every night.
-- 
View this message in context: http://old.nabble.com/Text-contained-in-HTML-comments-causing-BAYES_00-to-classify-as-non-spam-tp29342874p29350820.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Bowie Bailey <Bo...@BUC.com>.
 On 8/4/2010 4:23 AM, Happy Chap wrote:
> Hi,
>
> We started getting (over the last 2 months say) lots of spam, which
> Spamassassin isn't picking up as spam. Analysing these, they all seem to be
> of the same type where many paragraphs of random text are "hidden" inside an
> HTML comment (either contained in <!-- --> or inbetween /* and */ "tags").
>
> Because of this "hidden" text, these messages are triggering BAYES_00 which,
> I think, is the major influence on them not being correctly identified by
> Spamassassin as spam.
>
> We're running a slightly old version of Spamassassin (v3.2.3) running on
> SuSE 10.3 but do run sa-update's regularly to pick up new rules (which,
> perhaps naively I thought was more important the upgrading Spamassassin
> itself).
>
> Has anyone got any advice on how to correctly identify these mails as spam?
>
> Do I just need to upgrade to Spamassassin 3.3.0 (I'm assuming that this
> probably won't make much difference because I'm already using the latest
> rule sets thanks to sa-update)?
>
> Any ideas/help would be very gratefully received as the users are now
> getting restless and, bayes training isn't really helping. Thanks.

You ARE manually training bayes (sa-learn) on these missed spams,
right?  That is probably the most useful thing you can do if you are
getting Bayes_00 on them.

-- 
Bowie

Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Happy Chap <sa...@happychap.plus.com>.

John Hardin wrote:
> 
> On Wed, 4 Aug 2010, Happy Chap wrote:
> 
> 
> Apart from BAYES_00 what rules are they hitting? 
> 
> 

Thanks for your reply John.

They're all more or less the same triggering:

BAYES_00
HTML_MESSAGE
MPART_ALT_DIFF
RDNS_NONE

and occasionally they also pick up one of the HTML_IMAGE_RATIO_xx triggers
too.


John Hardin wrote:
> 
> 
> Would they be classified 
> as spam if Bayes wasn't a factor?
> 
> 

No, probably not (HTML_MESSAGE, MPART_ALT_DIFF and RDNS_NONE aren't enough,
those that have HTML_IMAGE_RATIO_xx might just be enough but borderline).


John Hardin wrote:
> 
> 
> If that's the case, then train them as spam an you should be okay.
> 
> Regardless, 3.2.3 will be much less effective than 3.3.x and will only be 
> getting critical bugfixes (if anything) via sa-update. You should plan on 
> upgrading soon.
> 
> 

Well, I've headed the advice and upgraded the server this evening, which has
reminded me why I try and avoid doing upgrades!! It's obviously my lack of
knowledge but it always seems so difficult to work out why things fail
(tonight I eventually tracked the install fail down to not having
openssl-devel installed, which meant not having rand.h, causing
Crypt::OpenSSL::Random to fail, etc., etc. which stopped Mail::SpamAssassin
installing). Got there eventually but, boy, is it hard work.

Anyway, I'll keep on training the bayes database and see if running SA 3.3.1
improves the situation.

Thanks everyone for their help.

David.





-- 
View this message in context: http://old.nabble.com/Text-contained-in-HTML-comments-causing-BAYES_00-to-classify-as-non-spam-tp29342874p29351002.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by John Hardin <jh...@impsec.org>.
On Wed, 4 Aug 2010, Happy Chap wrote:

> In that case (and I've been barking up the wrong tree) do you have any 
> suggestion as to what my next move should be to try to trap this type of 
> spam? I'm moderately technical, but I think I've probably reached the 
> limit of my current knowledge but am happy to learn if you could just 
> point me in the right direction.

Apart from BAYES_00 what rules are they hitting? Would they be classified 
as spam if Bayes wasn't a factor?

If that's the case, then train them as spam an you should be okay.

Regardless, 3.2.3 will be much less effective than 3.3.x and will only be 
getting critical bugfixes (if anything) via sa-update. You should plan on 
upgrading soon.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   USMC Rules of Gunfighting #7: In ten years nobody will remember the
   details of caliber, stance, or tactics. They will only remember who
   lived.
-----------------------------------------------------------------------
  Tomorrow: the 275th anniversary of John Peter Zenger's acquittal

Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Happy Chap <sa...@happychap.plus.com>.


Henrik K wrote:
> 
> On Wed, Aug 04, 2010 at 06:58:52AM -0700, Happy Chap wrote:
> 
> Do the tokens look such that they might be used in legimate messages?
> Usually you just have to sa-learn --spam enough of such spams to get
> atleast
> BAYES_50.
> 
> I have no idea what kind of spams they are, but it all depends on whether
> they have any tokens in common. But I can tell you that it's very rare to
> get BAYES_00 for spam if you just learn them properly.
> 
> 

Hi Henrik,

Certainly some of them look legitimate. Maybe we just haven't got enough
into sa-learn yet for it to have any effect. I don't know exactly, but
suppose the user's been getting 20 per day for say 8 weeks (these are both
guesses). So that's around 800 that should have been used to train bayes IF
the user had sent every one for training. I can see they have about 34k
identified spam mails in their bayes db, so these extra 800 would amount to
about 2%. Perhaps that's just not enough?

Thanks, David.
-- 
View this message in context: http://old.nabble.com/Text-contained-in-HTML-comments-causing-BAYES_00-to-classify-as-non-spam-tp29342874p29350853.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Henrik K <he...@hege.li>.
On Wed, Aug 04, 2010 at 06:58:52AM -0700, Happy Chap wrote:
> 
> 
> 
> Henrik K wrote:
> > 
> > 
> > Instead of speculating, try:
> > 
> > cat msg | spamassassin -t -D bayes 2>&1 | grep bayes:
> > 
> > It will tell you exactly what tokens are considered.
> > 
> > 
> 
> Hi Henrik,
> 
> Thanks for your reply.
> 
> I'm not sure I totally understand all of the output to that, but I think
> that's telling me that it isn't taking the text in the comments into account
> - I can see various strings that it's picking up from the email, but the
> commented text isn't obviously there. Maybe that's what you were trying to
> tell me anyway :-)
> 
> In that case (and I've been barking up the wrong tree) do you have any
> suggestion as to what my next move should be to try to trap this type of
> spam? I'm moderately technical, but I think I've probably reached the limit
> of my current knowledge but am happy to learn if you could just point me in
> the right direction.

Do the tokens look such that they might be used in legimate messages?
Usually you just have to sa-learn --spam enough of such spams to get atleast
BAYES_50.

I have no idea what kind of spams they are, but it all depends on whether
they have any tokens in common. But I can tell you that it's very rare to
get BAYES_00 for spam if you just learn them properly.


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Happy Chap <sa...@happychap.plus.com>.


Henrik K wrote:
> 
> 
> Instead of speculating, try:
> 
> cat msg | spamassassin -t -D bayes 2>&1 | grep bayes:
> 
> It will tell you exactly what tokens are considered.
> 
> 

Hi Henrik,

Thanks for your reply.

I'm not sure I totally understand all of the output to that, but I think
that's telling me that it isn't taking the text in the comments into account
- I can see various strings that it's picking up from the email, but the
commented text isn't obviously there. Maybe that's what you were trying to
tell me anyway :-)

In that case (and I've been barking up the wrong tree) do you have any
suggestion as to what my next move should be to try to trap this type of
spam? I'm moderately technical, but I think I've probably reached the limit
of my current knowledge but am happy to learn if you could just point me in
the right direction.

Thanks, David.



-- 
View this message in context: http://old.nabble.com/Text-contained-in-HTML-comments-causing-BAYES_00-to-classify-as-non-spam-tp29342874p29346570.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Henrik K <he...@hege.li>.
On Wed, Aug 04, 2010 at 01:23:32AM -0700, Happy Chap wrote:
> 
> Hi,
> 
> We started getting (over the last 2 months say) lots of spam, which
> Spamassassin isn't picking up as spam. Analysing these, they all seem to be
> of the same type where many paragraphs of random text are "hidden" inside an
> HTML comment (either contained in <!-- --> or inbetween /* and */ "tags").
> 
> Because of this "hidden" text, these messages are triggering BAYES_00 which,
> I think, is the major influence on them not being correctly identified by
> Spamassassin as spam.

Instead of speculating, try:

cat msg | spamassassin -t -D bayes 2>&1 | grep bayes:

It will tell you exactly what tokens are considered.


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> >It's unlikely that that could push the BAYES RESULT down to BAYES_00
> >unless there is uncorrected mistraining.

On 04.08.10 06:07, Happy Chap wrote:
> Possibly, but I suspect mistraining isn't a problem because apart from this
> specific type of spam, Spamassassin is doing (and has done for sometime) a
> very good job of correctly identifying mail properly.

However, if you feed the mentioned spam to SA, it gets classified as ham,
which means the SA is not doing very good job for this kind of spam.
It is apparently caused by mistraining and can be solved by proper training.
(apparently many ham contains the same tokens).

> >I don't think the 3.2.x rules get updated much. Perhaps this is leading
> >to false autotraining in BAYES.

> Incidentally, I'm not sure the autotraining is much of a problem as it only
> seems to be very obvious (high scoring) spam (and ham) that triggers
> autotraining, according to the headers at least. Certainly none of this
> particular type of spam is getting autotrainined according to the headers.

luckily you can re-train all misclassified spam and ham, and you are doing
it, aren't you?

> Finally, do you know if Spamassassin has rules that *should* catch this type
> of spam (ie. no legitimate email would include big blocks of random
> paragraphs inside HTML comments). I would have thought that of itself would
> have perhaps been picked up by a rule to identify it as spam.

the bayes_use_hapaxes (default on) could help here. 
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Save the whales. Collect the whole set.

Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by Happy Chap <sa...@happychap.plus.com>.
Hi RW, thanks for your reply.

>It's unlikely that that could push the BAYES RESULT down to BAYES_00
>unless there is uncorrected mistraining.

Possibly, but I suspect mistraining isn't a problem because apart from this
specific type of spam, Spamassassin is doing (and has done for sometime) a
very good job of correctly identifying mail properly. If I do a dump of the
bayes database, we've got about 30k each of spam & ham that's it's learned
from and based on those numbers I don't think the %age of mistrained
messages would be significant at all if the odd few  were mistrained.

>I don't think the 3.2.x rules get updated much. Perhaps this is leading
>to false autotraining in BAYES.

Ah, perhaps this is more of a problem, I didn't realise there were different
rule updates based on the versions of Spamassassin (well, not between 3.2.x
and 3.3.x anyway). In that case, I'll try upgrading Spamassassin and see if
that helps.

Incidentally, I'm not sure the autotraining is much of a problem as it only
seems to be very obvious (high scoring) spam (and ham) that triggers
autotraining, according to the headers at least. Certainly none of this
particular type of spam is getting autotrainined according to the headers.

Finally, do you know if Spamassassin has rules that *should* catch this type
of spam (ie. no legitimate email would include big blocks of random
paragraphs inside HTML comments). I would have thought that of itself would
have perhaps been picked up by a rule to identify it as spam.

Thanks again, David.
-- 
View this message in context: http://old.nabble.com/Text-contained-in-HTML-comments-causing-BAYES_00-to-classify-as-non-spam-tp29342874p29345981.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Posted by RW <rw...@googlemail.com>.
On Wed, 4 Aug 2010 01:23:32 -0700 (PDT)
Happy Chap <sa...@happychap.plus.com> wrote:

> 
> Hi,
> 
> We started getting (over the last 2 months say) lots of spam, which
> Spamassassin isn't picking up as spam. Analysing these, they all seem
> to be of the same type where many paragraphs of random text are
> "hidden" inside an HTML comment (either contained in <!-- --> or
> inbetween /* and */ "tags").
> 
> Because of this "hidden" text, these messages are triggering BAYES_00
> which, I think, is the major influence on them not being correctly
> identified by Spamassassin as spam.
 
It's unlikely that that could push the BAYES RESULT down to BAYES_00
unless there is uncorrected mistraining.


> Do I just need to upgrade to Spamassassin 3.3.0 (I'm assuming that
> this probably won't make much difference because I'm already using
> the latest rule sets thanks to sa-update)?

I don't think the 3.2.x rules get updated much. Perhaps this is leading
to false autotraining in BAYES.