You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2006/05/02 13:27:58 UTC

[Bug 4892] New: The abbreviation for Oxfordshire causes high Spam Score

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892

           Summary: The abbreviation for Oxfordshire causes high Spam Score
           Product: Spamassassin
           Version: 3.1.0
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: major
          Priority: P5
         Component: Score Generation
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: vicky.clarke@digi-products.com
                CC: vicky.clarke@digi-products.com


Hello! The word o x o n (no spaces!) causes FUZZY_XPILL BODY value in spam 
score.
SpamAssassin is giving any emails with the word O x o n (no spaces - I just 
don't want this to get spammed too!) a high score of 4.1 when if I remove that 
1 word it becomes 1.4! It's a very common word being short for the county of 
Oxfordshire as well as the animal so I wondered if you could stop this 
happening please? 
SpamAssassin Report for mail with o x o n in:
Wed 2006-04-26 17:15:57: Spam Filter processing
c:\mdaemon\localq\md50001193349.msg...
Wed 2006-04-26 17:15:57: > Message return-path:
info@sportsworld.digi-email.com
Wed 2006-04-26 17:15:57: > Message from: info@sportsworld.digi-email.com
Wed 2006-04-26 17:15:57: > Message to: vicky.clarke@digi-products.com
Wed 2006-04-26 17:15:57: > Message subject: fmanager
Wed 2006-04-26 17:15:57: > Message ID:
<CH...@web1>
Wed 2006-04-26 17:15:57: Start SpamAssassin results
Wed 2006-04-26 17:15:57: 4.10 points, 3.00 required
Wed 2006-04-26 17:15:57: *  0.8 EXTRA_MPART_TYPE Header has extraneous
Content-type:...type= entry
Wed 2006-04-26 17:15:57: *  2.6 FUZZY_XPILL BODY: Attempt to obfuscate words
in spam
Wed 2006-04-26 17:15:57: *  0.1 HTML_TAG_EXIST_TBODY BODY: HTML has "tbody"
tag
Wed 2006-04-26 17:15:57: *  0.0 HTML_MESSAGE BODY: HTML included in message
Wed 2006-04-26 17:15:57: *  0.3 HTML_FONT_BIG BODY: HTML tag for a big font
size
Wed 2006-04-26 17:15:57: *  0.2 MIME_BOUND_NEXTPART Spam tool pattern in
MIME boundary
Wed 2006-04-26 17:15:57: End SpamAssassin results
Wed 2006-04-26 17:15:57: * c:\mdaemon\localq\md50001193349.msg deleted

Without the word o x o n the FUZZY_XPILL BODY doesn't appear in results.
Thanks very much - please email or call on +44 (0)1189 841567 if you need more 
info!



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From jm@jmason.org  2007-03-15 05:50 -------
testing replacement rule now



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From jm@jmason.org  2007-10-08 10:43 -------
(In reply to comment #19)
> testing replacement rule now

the original isn't stellar these days, but the replacement certainly isn't
working too well:

0.00000 	 0.1018  796 of 781638 messages  	 0.0129  21 of 162569 messages  	
0.887 	 0.65 	 3.40 	FUZZY_XPILL 	 	

0.00000 	0.0000 0 of 781638 messages 	0.0000 0 of 162569 messages 	0.500 	0.48 
0.01 	T_FUZZY_XPILL_BUG4892




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From nj@leverton.org  2006-08-31 14:42 -------
Created an attachment (id=3678)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3678&action=view)
Suggested fix (apply \b to the pattern).




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From vicky.clarke@digi-products.com  2006-05-03 08:30 -------
Created an attachment (id=3504)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3504&action=view)
Jpeg required for htm mail previously attached

Would be much easier if you allowed upload of zip/rar files!



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From vicky.clarke@digi-products.com  2006-05-03 08:33 -------
Created an attachment (id=3509)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3509&action=view)
Jpeg required for htm mail previously attached




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From vicky.clarke@digi-products.com  2006-05-03 08:33 -------
Created an attachment (id=3510)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3510&action=view)
Jpeg required for htm mail previously attached




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From nj@leverton.org  2006-08-31 14:39 -------
Created an attachment (id=3677)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3677&action=view)
Example causing FP (text obfuscated for customer privacy).




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From sidney@sidney.com  2006-08-31 14:31 -------
I'm pasting in the following suggestion that Nick Leverton made in bug 5075 so
it doesn't get lost in this discussion. Can someone test the effect of this
suggestion on some corpora?

 --------------------

The simplest fix seems to be to add \b to the rule as follows:     
     
body FUZZY_XPILL        /<inter W3><post P2>(?!xanax)\b<X><A><N><A><X>/i




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From vicky.clarke@digi-products.com  2006-05-03 08:26 -------
Created an attachment (id=3502)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3502&action=view)
Htm file which causes the error when file is sent via chilcat from our server.

I'm not sure if I explained fully that these htm files are generated on our
server via asp and then mailed via chilkat. The SpamAssassin scores are
generated as the mail is processed by MDaemon on our mail server.
The attached mail had a score of 4.1 when containing the word O x o n but
without it I got 1.4. 
Thanks for looking into this - please let me know if you need any further info.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From vicky.clarke@digi-products.com  2006-05-03 08:31 -------
Created an attachment (id=3506)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3506&action=view)
Jpeg required for htm mail previously attached




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892


sidney@sidney.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nj@leverton.org




------- Additional Comments From sidney@sidney.com  2006-08-31 14:25 -------
*** Bug 5075 has been marked as a duplicate of this bug. ***



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From vicky.clarke@digi-products.com  2006-06-02 11:04 -------
(In reply to comment #12)
> The problem is not the word Oxon by itself. The example contains an address 
that
> has the lines
> Abingdon,
> Oxon.
> OX14 3JF.
> which in HTML look like
> Abingdon,<BR>Oxon.<BR>OX14 3JF.<BR>
> When the HTML tags are removed to process the text in the body, the string
> Oxon.
> OX14
> is a fuzzy match for 'xanax' in the FUZZY_XPILL_BODY rule. The initial 'O' and
> the final '14' are ignored, as are newlines and spaces, leaving xon.OX as what
> is fuzzily matching with 'xanax'.
> Whether that should be a match I leave for someone with more familiarity with
> the fuzzy match rules to decide now that the problem has been narrowed down.
> I do wonder if <br> should be replaced with a newline and the fuzzy match 
should
> not go across lines, if that is possible with the way we parse out text from 
HTML.

Hi there - thank you for looking into this. I'm sorry I didn't get the email 
regarding your finding because any mail with <BR>Oxon.<BR>OX14 would have got 
spammed! Our spam is deleted you see. Does that mean any business in that 
Oxfordshire postcode sending html mails with their address and abbreviation 
Oxon will get their mailed marked as spam?
Would you be able to let me know when you will have a fix for this please? Or 
is there a work around I can use for now? 
Thank you very much!
Vicky




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892


sidney@sidney.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #3502|text/html                   |text/plain
          mime type|                            |




------- Additional Comments From sidney@sidney.com  2006-05-03 11:20 -------
(From update of attachment 3502)
changing mime type of attachment to text/plain to make it easier to view in
bugzilla




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From vicky.clarke@digi-products.com  2006-05-03 08:32 -------
Created an attachment (id=3507)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3507&action=view)
Jpeg required for htm mail previously attached




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From sidney@sidney.com  2006-05-03 11:45 -------
The problem is not the word Oxon by itself. The example contains an address that
has the lines

Abingdon,
Oxon.
OX14 3JF.

which in HTML look like

Abingdon,<BR>Oxon.<BR>OX14 3JF.<BR>

When the HTML tags are removed to process the text in the body, the string

Oxon.
OX14

is a fuzzy match for 'xanax' in the FUZZY_XPILL_BODY rule. The initial 'O' and
the final '14' are ignored, as are newlines and spaces, leaving xon.OX as what
is fuzzily matching with 'xanax'.

Whether that should be a match I leave for someone with more familiarity with
the fuzzy match rules to decide now that the problem has been narrowed down.

I do wonder if <br> should be replaced with a newline and the fuzzy match should
not go across lines, if that is possible with the way we parse out text from HTML.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From vicky.clarke@digi-products.com  2006-05-03 08:29 -------
Created an attachment (id=3503)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3503&action=view)
Jpeg required for htm mail previously attached




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From felicity@apache.org  2006-05-02 13:45 -------
I can't reproduce this, please attach (via the web form) a sample mail which has
this problem.  



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From vicky.clarke@digi-products.com  2006-05-03 08:31 -------
Created an attachment (id=3505)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3505&action=view)
Jpeg required for htm mail previously attached




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From vicky.clarke@digi-products.com  2006-05-03 08:32 -------
Created an attachment (id=3508)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3508&action=view)
Jpeg required for htm mail previously attached




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4892] The abbreviation for Oxfordshire causes high Spam Score

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4892





------- Additional Comments From nj@leverton.org  2006-08-31 14:46 -------
(In reply to comment #12) 
> I do wonder if <br> should be replaced with a newline and the fuzzy match 
should 
> not go across lines, if that is possible with the way we parse out text from 
HTML. 
 
The Post Office recommends that postcodes follow the county on the same line, 
although many people do split it onto two lines as in the OP's example.  My 
example though shows them both on the same line. 
 



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.