You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Amir 'CG' Caspi <ce...@3phase.com> on 2013/06/30 20:42:53 UTC

LONGWORDS not hitting?

Hi all,

	Just got this spam:

http://pastebin.com/KM5paaZ9

To me, it looks like LONGWORDS should have hit... but it didn't.  I 
ran it manually through spamassassin and spamc, and LONGWORDS still 
didn't hit, so it seems to just not be hitting that rule.  But, to my 
eye, it looks like it should.  Any idea why it failed, and should 
LONGWORDS be updated?

(And yes, I know it only hit BAYES_50... I really think these 
gibberish strings are confusing Bayes.  This is also another example 
of where an HTML COMMENT GIBBERISH rule would help. ;-) )

Cheers!

						--- Amir

Re: LONGWORDS not hitting?

Posted by Martin Gregorie <ma...@gregorie.org>.

On Sun, 2013-06-30 at 20:44 +0100, RW wrote:
> On Sun, 30 Jun 2013 12:42:53 -0600
> Amir 'CG' Caspi wrote:
> 
> > Hi all,
> > 
> > 	Just got this spam:
> > 
> > http://pastebin.com/KM5paaZ9
> > 
> 
> > (And yes, I know it only hit BAYES_50... I really think these 
> > gibberish strings are confusing Bayes.  
> 
> I don't think Bayes tokenizes html. When I displayed it in claws mail
> (with the dillo plugin) I just saw 4 links. Bayes is just seeing the
> displayed texts from those links and some tokens from the URIs.
> 
Yes. All the textual garbage is in two HTML comments, i.e. between
"<!--" and "-->", so its quite possible that SA's HTML converter would
skip it because the recipient wouldn't see it.

However, its HTML: there are two <body> tags and only one </body> in the
message, so maybe that's why the HTML_TAG_BALANCE_BODY rule fired? 


Martin

Re: LONGWORDS not hitting?

Posted by RW <rw...@googlemail.com>.

On Sun, 30 Jun 2013 23:01:10 +0200
Benny Pedersen wrote:

> RW skrev den 2013-06-30 21:44:
> 
> > I don't think Bayes tokenizes html. When I displayed it in claws
> > mail (with the dillo plugin) I just saw 4 links. Bayes is just
> > seeing the displayed texts from those links and some tokens from
> > the URIs.
> 
> bayes digest it all, its just body that only see html part without
> html entirety, rawbody is needed to make html rules hit for invalid
> html tags, if it was a body then tags is removed before checking
> 
> it does not matter what poinson is in spam mails aslong one learn it
> as spam
> 
> i am fairly sure bayes digest whole msgs,



The sources of the body tokens is:

  $msgdata->{bayes_token_body} = $msg->{msg}->get_visible_rendered_body_text_array();

  $msgdata->{bayes_token_inviz} = $msg->{msg}->get_invisible_rendered_body_text_array();

which suggests it's rendered. The debug is consistent with this:

$ spamassassin -D bayes < /tmp/spam.txt 2>&1 | grep "dbg: bayes: token"
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'H*Ad:U*user' => 0.999370857921017
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'Hx-languages-length:146' => 0.999231281198003
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'Wireless' => 0.00584052835290255
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'wireless' => 0.0152476277925936
Jun 30 23:59:12.357 [20054] dbg: bayes: token '6985' => 0.0156699029126214
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'solutions' => 0.0270166806452548
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'mobile' => 0.0442780827402737
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'preferences' => 0.048896998570629
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'truly' => 0.0564015902450925
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'Internet' => 0.118115920775885
Jun 30 23:59:12.357 [20054] dbg: bayes: token 'UD:tv' => 0.131053546374482

Re: LONGWORDS not hitting?

Posted by Benny Pedersen <me...@junc.eu>.

RW skrev den 2013-06-30 21:44:

> I don't think Bayes tokenizes html. When I displayed it in claws mail
> (with the dillo plugin) I just saw 4 links. Bayes is just seeing the
> displayed texts from those links and some tokens from the URIs.

bayes digest it all, its just body that only see html part without html 
entirety, rawbody is needed to make html rules hit for invalid html 
tags, if it was a body then tags is removed before checking

it does not matter what poinson is in spam mails aslong one learn it as 
spam

i am fairly sure bayes digest whole msgs, the 4 urls could be score 
more high with

meta URIBL_BLACK (3) (3) (3) (3)

dynamicly add 3 to currect score in spamassassin corpus, it was listed 
imho this one, just scored to little

-- 
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it

Re: LONGWORDS not hitting?

Posted by RW <rw...@googlemail.com>.

On Sun, 30 Jun 2013 12:42:53 -0600
Amir 'CG' Caspi wrote:

> Hi all,
> 
> 	Just got this spam:
> 
> http://pastebin.com/KM5paaZ9
> 

> (And yes, I know it only hit BAYES_50... I really think these 
> gibberish strings are confusing Bayes.  

I don't think Bayes tokenizes html. When I displayed it in claws mail
(with the dillo plugin) I just saw 4 links. Bayes is just seeing the
displayed texts from those links and some tokens from the URIs.