You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Amir 'CG' Caspi <ce...@3phase.com> on 2013/06/30 20:42:53 UTC

LONGWORDS not hitting?

Hi all,

	Just got this spam:

http://pastebin.com/KM5paaZ9

To me, it looks like LONGWORDS should have hit... but it didn't.  I 
ran it manually through spamassassin and spamc, and LONGWORDS still 
didn't hit, so it seems to just not be hitting that rule.  But, to my 
eye, it looks like it should.  Any idea why it failed, and should 
LONGWORDS be updated?

(And yes, I know it only hit BAYES_50... I really think these 
gibberish strings are confusing Bayes.  This is also another example 
of where an HTML COMMENT GIBBERISH rule would help. ;-) )

Cheers!

						--- Amir

Re: LONGWORDS not hitting?

Posted by Martin Gregorie <ma...@gregorie.org>.
On Sun, 2013-06-30 at 20:44 +0100, RW wrote:
> On Sun, 30 Jun 2013 12:42:53 -0600
> Amir 'CG' Caspi wrote:
> 
> > Hi all,
> > 
> > 	Just got this spam:
> > 
> > http://pastebin.com/KM5paaZ9
> > 
> 
> > (And yes, I know it only hit BAYES_50... I really think these 
> > gibberish strings are confusing Bayes.  
> 
> I don't think Bayes tokenizes html. When I displayed it in claws mail
> (with the dillo plugin) I just saw 4 links. Bayes is just seeing the
> displayed texts from those links and some tokens from the URIs.
> 
Yes. All the textual garbage is in two HTML comments, i.e. between
"<!--" and "-->", so its quite possible that SA's HTML converter would
skip it because the recipient wouldn't see it.

However, its HTML: there are two <body> tags and only one </body> in the
message, so maybe that's why the HTML_TAG_BALANCE_BODY rule fired? 


Martin




Re: LONGWORDS not hitting?

Posted by RW <rw...@googlemail.com>.
On Sun, 30 Jun 2013 23:01:10 +0200
Benny Pedersen wrote:

> RW skrev den 2013-06-30 21:44:
> 
> > I don't think Bayes tokenizes html. When I displayed it in claws
> > mail (with the dillo plugin) I just saw 4 links. Bayes is just
> > seeing the displayed texts from those links and some tokens from
> > the URIs.
> 
> bayes digest it all, its just body that only see html part without
> html entirety, rawbody is needed to make html rules hit for invalid
> html tags, if it was a body then tags is removed before checking
> 
> it does not matter what poinson is in spam mails aslong one learn it
> as spam
> 
> i am fairly sure bayes digest whole msgs,



The sources of the body tokens is:

  $msgdata->{bayes_token_body} = $msg->{msg}->get_visible_rendered_body_text_array();

  $msgdata->{bayes_token_inviz} = $msg->{msg}->get_invisible_rendered_body_text_array();

which suggests it's rendered. The debug is consistent with this:

$ spamassassin -D bayes < /tmp/spam.txt 2>&1 | grep "dbg: bayes: token"
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'H*Ad:U*user' => 0.999370857921017
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'Hx-languages-length:146' => 0.999231281198003
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'Wireless' => 0.00584052835290255
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'wireless' => 0.0152476277925936
Jun 30 23:59:12.357 [20054] dbg: bayes: token '6985' => 0.0156699029126214
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'solutions' => 0.0270166806452548
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'mobile' => 0.0442780827402737
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'preferences' => 0.048896998570629
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'truly' => 0.0564015902450925
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'Internet' => 0.118115920775885
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'UD:tv' => 0.131053546374482

Re: LONGWORDS not hitting?

Posted by Benny Pedersen <me...@junc.eu>.
RW skrev den 2013-06-30 21:44:

> I don't think Bayes tokenizes html. When I displayed it in claws mail
> (with the dillo plugin) I just saw 4 links. Bayes is just seeing the
> displayed texts from those links and some tokens from the URIs.

bayes digest it all, its just body that only see html part without html 
entirety, rawbody is needed to make html rules hit for invalid html 
tags, if it was a body then tags is removed before checking

it does not matter what poinson is in spam mails aslong one learn it as 
spam

i am fairly sure bayes digest whole msgs, the 4 urls could be score 
more high with

meta URIBL_BLACK (3) (3) (3) (3)

dynamicly add 3 to currect score in spamassassin corpus, it was listed 
imho this one, just scored to little

-- 
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it

Re: LONGWORDS not hitting?

Posted by RW <rw...@googlemail.com>.
On Sun, 30 Jun 2013 12:42:53 -0600
Amir 'CG' Caspi wrote:

> Hi all,
> 
> 	Just got this spam:
> 
> http://pastebin.com/KM5paaZ9
> 

> (And yes, I know it only hit BAYES_50... I really think these 
> gibberish strings are confusing Bayes.  

I don't think Bayes tokenizes html. When I displayed it in claws mail
(with the dillo plugin) I just saw 4 links. Bayes is just seeing the
displayed texts from those links and some tokens from the URIs.


Re: LONGWORDS not hitting?

Posted by RW <rw...@googlemail.com>.
On Sun, 30 Jun 2013 21:35:38 -0600
Amir 'CG' Caspi wrote:


> At 12:01 AM +0100 07/01/2013, RW wrote:
> >The sources of the body tokens is:
> >
> >   $msgdata->{bayes_token_body} = 
> >$msg->{msg}->get_visible_rendered_body_text_array();
> >
> >   $msgdata->{bayes_token_inviz} = 
> >$msg->{msg}->get_invisible_rendered_body_text_array();
> >
> >which suggests it's rendered. The debug is consistent with this:
> 
> So are you saying Bayes won't see rawbody at all?  It just uses body? 
> Or does it have tokens from both body and rawbody?
> 
> Also, what is "invisible" rendered body text?  Would this include the
> comments?

AFAIK invisible is things like very small fonts and text with poor or no
contrast. 

Re: LONGWORDS not hitting?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
At 3:24 PM +0200 07/01/2013, Benny Pedersen wrote:
>if content end user see is mangled, then end user cant relearn ham to be spam

Yes, they can, because SA sees the "mangled" email before the user 
does.  Therefore if SA misclassifies an email as ham, that exact same 
email is the one seen by the end-user and can be reclassified as spam 
via sa-learn.

>yep point is that if it mangle both ham and spam, then digest would 
>create digest in bayes_50 :(

Only the MailScanner token would be seen in both ham and spam.  There 
are hundreds or thousands of other tokens.

>there is no way around that, execpt dont use mailscanner, or patch 
>mangling to be removed

As discussed last week, we need to use MailScanner for security and I 
prefer to keep the URL munging intact to disable web bugs.

>this part does not work for spamassassin

As mentioned, it's _only_ this part that "does not work," but it 
shouldn't be causing specific problems.  By the way, this is also not 
the issue with what I asked originally, which is: why didn't 
LONGWORDS hit on this email, even though it seemed like it should? 
That isn't caused by MailScanner.

BTW, I also mentioned last week that it should be pretty easy to 
write a plugin for SA to "unmangle" the MailScanner URLs, because the 
original URL is contained within the ALT attribute of the IMG tag. 
This could be done prior to the Bayes analysis (or written as part of 
the Bayes code).  I unfortunately don't know enough about the guts of 
SA to write such a plugin, at least not yet, but the algorithm itself 
should be relatively straightforward given how MailScanner does its 
URL mangling.

Cheers.

						--- Amir

Re: LONGWORDS not hitting?

Posted by Benny Pedersen <me...@junc.eu>.
Amir 'CG' Caspi skrev den 2013-07-01 05:35:
> At 11:23 PM +0200 06/30/2013, Benny Pedersen wrote:
>> does it continue if one msg is learned as spam, does it still after 
>> say bayes_50 ?
> No, it has BAYES_99 if I learn the message.  That is, running SA on
> the SAME message will give BAYES_99 after it's learned.  It's not a
> ham problem.

super, that means it works

>> you should just stop going to the urls in the spam mails, one more 
>> point is mailscanner mangle
>> content, with here poinson bayes diggest
> I am _NOT_ going to the URLs in the spam mail.  I'm not sure what you
> mean by that suggestion.  I know MailScanner is munging the URLs, but
> that is only for web bugs (not for links).  Also, see below.

if content end user see is mangled, then end user cant relearn ham to 
be spam

yep point is that if it mangle both ham and spam, then digest would 
create digest in bayes_50 :(

there is no way around that, execpt dont use mailscanner, or patch 
mangling to be removed

the mangling part of mailscanner should be in imap proxy or pop3 proxy 
so mailserver have original content as sent, if spamassassin add a 
header to help relearn bayes like dspam does then it could mangle in 
mailscanner like it does now, since dspam knows with digest is original 
changed digest pr email

this part does not work for spamassassin

-- 
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it

Re: LONGWORDS not hitting?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
At 11:23 PM +0200 06/30/2013, Benny Pedersen wrote:
>does it continue if one msg is learned as spam, does it still after 
>say bayes_50 ?

No, it has BAYES_99 if I learn the message.  That is, running SA on 
the SAME message will give BAYES_99 after it's learned.  It's not a 
ham problem.

>you should just stop going to the urls in the spam mails, one more 
>point is mailscanner mangle content, with here poinson bayes diggest

I am _NOT_ going to the URLs in the spam mail.  I'm not sure what you 
mean by that suggestion.  I know MailScanner is munging the URLs, but 
that is only for web bugs (not for links).  Also, see below.

>verify that if mails are sent first to spamassassin and mailscanner 
>mangle LAST in chain

The way my system is set up, there is no way to get SA to run before 
MailScanner.  MailScanner has to run first.  It's not possible to 
change this without a lot of reconfiguring, unfortunately, due to the 
way the system is set up.

>its very important spamassassin see original content unmangled

We had that discussion a few weeks ago -- since MailScanner munges 
both ham and spam, it has essentially no effect on the Bayes score.

At 12:01 AM +0100 07/01/2013, RW wrote:
>The sources of the body tokens is:
>
>   $msgdata->{bayes_token_body} = 
>$msg->{msg}->get_visible_rendered_body_text_array();
>
>   $msgdata->{bayes_token_inviz} = 
>$msg->{msg}->get_invisible_rendered_body_text_array();
>
>which suggests it's rendered. The debug is consistent with this:

So are you saying Bayes won't see rawbody at all?  It just uses body? 
Or does it have tokens from both body and rawbody?

Also, what is "invisible" rendered body text?  Would this include the comments?

Even if comments are invisible to the user, they should still end up 
inside the body tags.  Consider: on every web browser, when you "view 
source," you can see comments and similar things.  They are not 
"rendered" in the sense that they're not displayed, but they are 
certainly processed by the HTML engine.  Anything within an HTML tag 
is processed, which is why you can see comments when you view source. 
It's still in the "body" ... just invisible.

Because of this, I would hope that HTML comments would end up within 
the Bayes "body" tags even if they are invisible.  Is there any way 
to verify this?  Since the debug output shows tokens, I guess one 
could make a test email, put some markers inside comments, and see if 
those markers show up in the Bayes tokenization debug output...

Thanks.

						--- Amir

Re: LONGWORDS not hitting?

Posted by Benny Pedersen <me...@junc.eu>.
Amir 'CG' Caspi skrev den 2013-06-30 23:09:

> very well.  The actual spammy content is only 5% of the message 
> (maybe
> less) and therefore doesn't "weigh" much in the Bayes analysis.

very well it could be

> because it reduces the efficacy of learning these messages, per the
> description above.

does it continue if one msg is learned as spam, does it still after say 
bayes_50 ?

if so lower autolearn ham score so it learns less ham automaticly

> written a "spellcheck" plugin for SA to do this?  Seems like a recipe
> for FPs, unfortunately.

perl have aspell and ispell it just missing plugin still

> regexp similar to what John Hardin made for STYLE_GIBBERISH should
> work for this, appropriately modified for comments rather than style
> tags.

you should just stop going to the urls in the spam mails, one more 
point is mailscanner mangle content, with here poinson bayes diggest

verify that if mails are sent first to spamassassin and mailscanner 
mangle LAST in chain

its very important spamassassin see original content unmangled

-- 
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it

Re: LONGWORDS not hitting?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
At 8:57 PM +0200 06/30/2013, Benny Pedersen wrote:
>well it might confuse bayes yes, but it cant confuse you to run 
>sa-learn --spam on it ?

I've been running "sa-learn --spam" on these messages for a month 
straight.  Some get picked up, others don't.  I'm still getting a lot 
of BAYES_50 on these, and I'm almost positive it's because of these 
enormous gibberish comments.  95% of the message content is this 
gibberish, and because it's random, it doesn't get picked up by Bayes 
very well.  The actual spammy content is only 5% of the message 
(maybe less) and therefore doesn't "weigh" much in the Bayes analysis.

In other words, learning these messages has far smaller effect than 
one might think it would, and I'm pretty certain one of the reasons 
the spammers are including kilobytes of gibberish text is exactly 
because it reduces the efficacy of learning these messages, per the 
description above.

>it could maybe add language checking on how many words is spelled 
>incorrect, compated to big msg sizes

How's it going to figure out what's spelled incorrectly, especially 
for people who might have messages not in English?  Has someone 
written a "spellcheck" plugin for SA to do this?  Seems like a recipe 
for FPs, unfortunately.

At 11:01 PM +0200 06/30/2013, Benny Pedersen wrote:
>it does not matter what poinson is in spam mails aslong one learn it as spam

Per above, I don't think this is correct.  If 95% of the poison is 
random and changes every time, the "important" part of the poison 
doesn't weigh much in the tokenization.  I run these messages through 
sa-learn every time, and it catches a few nearly-identical messages 
because of it, but the next day, or the next week, others that LOOK 
like they should have been caught will slip by.

I don't know if there is an algorithm update to Bayes that could help 
catch this, but adding an HTML_COMMENT_GIBBERISH rule with a fairly 
high score will at least help to offset the lack of Bayes hits.  One 
doesn't need to run it through lint or tidy or what-not... I think a 
regexp similar to what John Hardin made for STYLE_GIBBERISH should 
work for this, appropriately modified for comments rather than style 
tags.

Thanks.

						--- Amir

Re: LONGWORDS not hitting?

Posted by Benny Pedersen <me...@junc.eu>.
Amir 'CG' Caspi skrev den 2013-06-30 20:42:

> (And yes, I know it only hit BAYES_50... I really think these
> gibberish strings are confusing Bayes.  This is also another example
> of where an HTML COMMENT GIBBERISH rule would help. ;-) )

well it might confuse bayes yes, but it cant confuse you to run 
sa-learn --spam on it ?

it could maybe add language checking on how many words is spelled 
incorrect, compated to big msg sizes

-- 
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it

Re: LONGWORDS not hitting?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
At 1:43 PM +0100 08/24/2013, RW wrote:
>LONGWORDS is a body rule, i.e. it runs on a normalized  version of the

Gah, THAT'S why it wasn't working?  I feel like an idiot now. =P

						--- Amir

Re: LONGWORDS not hitting?

Posted by RW <rw...@googlemail.com>.
On Sat, 24 Aug 2013 00:23:17 -0600
Amir 'CG' Caspi wrote:

> Hi all,
> 
> 	Since it's been a couple of weeks with no reply, I thought I 
> might ask this again.  See below.
> 	Do I need to file a bug for SA?  Is this something obvious 
> that I'm missing?  Does the LONGWORDS rule need an update?

LONGWORDS is a body rule, i.e. it runs on a normalized  version of the
rendered text. Neither Bayes nor  LONGWORDS sees any of the words
you're looking at.

You could try writing a separate rawbody rule, but it would see all
of the  html and not just the comments.

Re: LONGWORDS not hitting?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
Hi all,

	Since it's been a couple of weeks with no reply, I thought I 
might ask this again.  See below.
	Do I need to file a bug for SA?  Is this something obvious 
that I'm missing?  Does the LONGWORDS rule need an update?

	It appears that LONGWORDS is failing to hit on the original 
(server-side, MBOX) email with all MIME components... but hits on the 
email once it has been interpreted as text by the MTA.  Something 
about the MIME-encoding is confusing LONGWORDS even though I can't 
see why, with my naked eye.
	Pastebin examples of both (server-side and MTA) versions are below.

Thanks.

						--- Amir

At 2:10 PM -0600 08/10/2013, Amir 'CG' Caspi wrote:
>At 12:42 PM -0600 06/30/2013, Amir 'CG' Caspi wrote:
>>Hi all,
>>
>>	Just got this spam:
>>
>>http://pastebin.com/KM5paaZ9
>>
>>To me, it looks like LONGWORDS should have hit... but it didn't.  I 
>>ran it manually through spamassassin and spamc, and LONGWORDS still 
>>didn't hit, so it seems to just not be hitting that rule.  But, to 
>>my eye, it looks like it should.  Any idea why it failed, and 
>>should LONGWORDS be updated?
>
>OK, more info and potentially new problem.  I re-tested one of the 
>spams I posted yesterday:
>http://pastebin.com/VCtvzjzV
>
>When running this example through SA (either SA standalone, or 
>spamc/spamd) now, LONGWORDS hits, as follows:
>
>Aug 10 15:47:20.115 [21805] dbg: rules: ran body rule __LONGWORDS_C 
>======> got hit: "authenticate dearth deplorers hogmane 
>fraudulentness going pillowcases believing vagotomy mastoidectomies "
>Aug 10 15:46:20.613 [21757] dbg: rules: ran body rule __LONGWORDS_B 
>======> got
>hit: "family husbandry allowed walloper little length voluntaries 
>weothao sternw
>ard "
>
>... BUT... this pastebin example is the copy/paste of "view raw 
>source" from my MUA.  If I run SA on the original server-side email 
>(i.e. the email as stored in my IMAP mailbox), LONGWORDS does _NOT_ 
>hit.  That is, neither _C nor _B hit on the server-side version, 
>despite hitting on the MUA version.
>
>For your perusal, I've copied the output of SA when running on the 
>server-side version, i.e. with all MIME content fully intact... see 
>here:
>
>http://pastebin.com/keNi5BjN
>
>What the heck is going on?  Why would LONGWORDS hit on the MUA 
>version but not the server-side?  Since LONGWORDS is a rawbody rule, 
>not based on headers, it seems like it should pop on both versions. 
>I'm guessing that there's something about the MIME content that's 
>making LONGWORDS fail to hit on the server-side (MBX) email, but 
>allows it to hit on the MUA ("view raw source") email... but I just 
>don't understand why that would be.
>
>I've had LONGWORDS hit at the server-side (pre-MUA) level, though 
>not very often (only 4 out of 465 messages currently in my spam 
>box), so it _is_ running... but for whatever reason, LONGWORDS hits 
>much more often (i.e. as it should) with the MUA "raw source" 
>versions than it does with server-side (MBOX/MBX) versions, so this 
>is not an isolated occurrence.
>
>So WTF is going on?  Does anyone have ideas?  To my eyeballs, the 
>exact same text is contained in both versions and therefore should 
>hit LONGWORDS in either version, but only one version pops.
>
>I'm happy to paste more debug output if it might help someone debug the rule.
>
>Thanks in advance.
>
>						--- Amir



Re: LONGWORDS not hitting?

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
At 12:42 PM -0600 06/30/2013, Amir 'CG' Caspi wrote:
>Hi all,
>
>	Just got this spam:
>
>http://pastebin.com/KM5paaZ9
>
>To me, it looks like LONGWORDS should have hit... but it didn't.  I 
>ran it manually through spamassassin and spamc, and LONGWORDS still 
>didn't hit, so it seems to just not be hitting that rule.  But, to 
>my eye, it looks like it should.  Any idea why it failed, and should 
>LONGWORDS be updated?

OK, more info and potentially new problem.  I re-tested one of the 
spams I posted yesterday:
http://pastebin.com/VCtvzjzV

When running this example through SA (either SA standalone, or 
spamc/spamd) now, LONGWORDS hits, as follows:

Aug 10 15:47:20.115 [21805] dbg: rules: ran body rule __LONGWORDS_C 
======> got hit: "authenticate dearth deplorers hogmane 
fraudulentness going pillowcases believing vagotomy mastoidectomies "
Aug 10 15:46:20.613 [21757] dbg: rules: ran body rule __LONGWORDS_B ======> got
hit: "family husbandry allowed walloper little length voluntaries 
weothao sternw
ard "

... BUT... this pastebin example is the copy/paste of "view raw 
source" from my MUA.  If I run SA on the original server-side email 
(i.e. the email as stored in my IMAP mailbox), LONGWORDS does _NOT_ 
hit.  That is, neither _C nor _B hit on the server-side version, 
despite hitting on the MUA version.

For your perusal, I've copied the output of SA when running on the 
server-side version, i.e. with all MIME content fully intact... see 
here:

http://pastebin.com/keNi5BjN

What the heck is going on?  Why would LONGWORDS hit on the MUA 
version but not the server-side?  Since LONGWORDS is a rawbody rule, 
not based on headers, it seems like it should pop on both versions. 
I'm guessing that there's something about the MIME content that's 
making LONGWORDS fail to hit on the server-side (MBX) email, but 
allows it to hit on the MUA ("view raw source") email... but I just 
don't understand why that would be.

I've had LONGWORDS hit at the server-side (pre-MUA) level, though not 
very often (only 4 out of 465 messages currently in my spam box), so 
it _is_ running... but for whatever reason, LONGWORDS hits much more 
often (i.e. as it should) with the MUA "raw source" versions than it 
does with server-side (MBOX/MBX) versions, so this is not an isolated 
occurrence.

So WTF is going on?  Does anyone have ideas?  To my eyeballs, the 
exact same text is contained in both versions and therefore should 
hit LONGWORDS in either version, but only one version pops.

I'm happy to paste more debug output if it might help someone debug the rule.

Thanks in advance.

						--- Amir