You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2009/05/26 00:07:08 UTC

[Bug 6119] New: TVD_SPACE_RATIO FP mail as wanted by JM

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119

           Summary: TVD_SPACE_RATIO FP mail as wanted by JM
           Product: Spamassassin
           Version: 3.2.5
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Rules
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: apache.org-sa-bugs@zmi.at


Justin said: please attach FPs you can share to tickets on bugzilla.  they do
help.

So I do, I found an old mail but that one is in my masscheck_HAM list already
and should be learned. Still, I got a report last week where a legit mail had
been FP'd and this rule was in, amongst others. Sorry can't share that mail.
But this one:

Received: by mailsrv1.zmi.at (Postfix, from userid 65534)       id 9D69F166ED;
        Wed, 25 Jul 2007 21:12:46 +0200 (CEST)
Received: from protegate5.zmi.at (protegate5.zmi.at [212.69.162.205])   (using
TLSv1
        with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))     (Client CN
"protegate1.zmi.at", Issuer "power4u.zmi.at" (not verified))
        by mailsrv1.zmi.at (Postfix) with ESMTP id 2FB5527EC
        for <OR...@DATAMATIX.AT>; Wed, 25 Jul 2007 21:12:46 +0200 (CEST)
X-Envelope-From: m12720600880x1@orbcomm2.net
Received: from localhost (localhost [127.0.0.1])
        by protegate5.zmi.at (Postfix) with ESMTP id 7BE451E0EF
        for <OR...@DATAMATIX.AT>; Wed, 25 Jul 2007 21:10:51 +0200 (CEST)
X-Virus-Scanned: amavisd-new at zmi.at
X-Spam-Score: 4.636
X-Spam-Level: ****
X-Spam-Status: No, score=4.636 tagged_above=-999 required=5
        tests=[AWL=0.186,
        BAYES_20=-0.74, DKIM_POLICY_SIGNSOME=0, DK_POLICY_SIGNSOME=0,
        FROM_LOCAL_DIGITS=0.001, FROM_LOCAL_HEX=1.2, INVALID_MSGID=1.9,
        L_P0F_UNKN=0.1, SPF_PASS=-0.001, TVD_SPACE_RATIO=1.99]
Received: from protegate5.zmi.at ([127.0.0.1])
        by localhost (protegate5.zmi.at [127.0.0.1]) (amavisd-new, port 10024)
        with ESMTP id AVfBWtBEdgkn for <OR...@DATAMATIX.AT>;
        Wed, 25 Jul 2007 21:10:47 +0200 (CEST)
X-Envelope-From: m12720600880x1@orbcomm2.net
Received: from fw1.orbcomm2.net (fw1.orbcomm2.net [208.44.94.213])
        by protegate5.zmi.at (Postfix) with ESMTP id 6C6721E0E7
        for <OR...@DATAMATIX.AT>; Wed, 25 Jul 2007 21:10:46 +0200 (CEST)
Received: from omsseh (omss.orbcomm2.net [10.201.26.11])
        by fw1.orbcomm2.net (8.9.3/8.9.3p2.fw2525 (GMSS3.3E Internal)) with
SMTP
        id TAA02634     for <OR...@DATAMATIX.AT>; Wed, 25 Jul 2007 19:10:44
GMT
From: M12720600880X1@ORBCOMM2.NET
Date: 25 Jul 2007 19:10:44 +0000
Message-ID: <"00020202 0001 46a7a034"* @MHS>
Subject: [GLOBALGRAM:SAT=35]
To: orbcom10@datamatix.at
Priority: non-urgent
X-sat_id: 35
X-ncc_id: 120
X-ncc_mha_ref: 1
X-ack_level: 0
X-DBMail-PhysMessage-ID: 610116
Return-Path: M12720600880X1@ORBCOMM2.NET
MIME-Version: 1.0
Content-Type: text/plain;
  charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
X-DBMail-PhysMessage-ID: 610117
X-Length: 2235
X-Original-X-UID: 1202527
X-UID: 8115

07-07-25,14:12:40,48.192509,16.349321,278.338989,132.710999,0.000000,12,73909,"STD"


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #2 from Michael Monnerie <ap...@zmi.at>  2009-05-25 22:59:15 PST ---
Created an attachment (id=4451)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4451)
Lots of FPs, but all of the same type of e-mail

Sorry but I have no other FPs to provide ATM, will report ASA I have the next
FP. Maybe I'll reset the score to defaults, then I should quickly get a report.
But that's nasty...


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119

Karsten Bräckelmann <gu...@rudersport.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Group|security                    |
          Component|Security                    |Libraries
         AssignedTo|security@spamassassin.apach |dev@spamassassin.apache.org
                   |e.org                       |

--- Comment #31 from Karsten Bräckelmann <gu...@rudersport.de> 2010-03-23 17:42:37 UTC ---
Moving back off of Security, which got changed by accident during the mass
Target Milestone move.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #9 from mouss <mo...@netoyen.net>  2009-05-26 23:18:09 PST ---
(In reply to comment #8)
> (In reply to comment #7)
> > This line for example would have a ratio of 18% on its own, still 13% with the
> > longish header-style prefix and no (munged?) linebreak.
> > 
> >   Over_to_maintainer_(via_the_GNATS_Auto_Assign_Tool)
> 
> Oops, underscores added by me. They are actually spaces.
> 

remove the signature from 4454 and the rule doesn't hit anymore. 

Maybe there's a way to ignore "simple" signatures? 


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #13 from Michael Monnerie <ap...@zmi.at>  2009-05-28 01:17:32 PST ---
> The first message, DBMail ID 610202, a human generated short note, is NOT a FP.
> TVD_SPACE_RATIO just doesn't fire on that one. The note itself has a ratio of
> 19 (11/57). Just as expected. Michael, why did you put that one in?

I must express my deepest apology for that mistake, sire! The error must have
come from me being human, having too much work, and being full of mistakes. I
will cut my left small finger off to remember the shame I brought to all of my
family, and remind me to be more careful for all times.

mfg zmi


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119


Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P5                          |P2




--- Comment #21 from Justin Mason <jm...@jmason.org>  2009-07-23 07:11:19 PST ---
going to take a look at these

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #17 from Warren Togami <wt...@redhat.com>  2009-07-17 05:55:30 PST ---
http://ruleqa.spamassassin.org/20090716-r794596-n/TVD_SPACE_RATIO/detail?s_corpus=1#corpus

Yikes, my statistics are very different from the other contributors.

I had noticed that simple mail containing only "test" in the body triggers this
rule.

I'll find a few of the others and attach them here.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #22 from Justin Mason <jm...@jmason.org>  2009-08-31 16:06:01 PST ---
if we want to change this for 3.3.0, it needs to be in SVN by this Thursday;
see bug 6155.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #6 from mouss <mo...@netoyen.net>  2009-05-26 12:58:42 PST ---
Created an attachment (id=4455)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4455)
and one from freebsd-current@freebsd.org


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #14 from Karsten Bräckelmann <gu...@rudersport.de>  2009-05-28 13:03:26 PST ---
> I must express my deepest apology for that mistake, sire!

No need to. :)  It perfectly serves as an example, that TVD_SPACE_RATIO indeed
does tend not to pick on real, human written text.

The FP samples so far all are trivial to rescue, and IMHO probably shouldn't
have been fed to SA in the first place. Given the current S/O in mass-check,
however, I do agree the score is seriously too high.

Candidate for sa-update.

Since the Target Milestone is set to 3.3.0, the score issue most likely will be
resolved magically by a full GA run. What I'd really like to sort about this
for 3.3.0 is, if the eval really still works as originally intended, or if it
maybe got broke by evaluating paragraphs instead of lines and the injected
trailing space.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] [review] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119

Henrik Krohns <he...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hege@hege.li
            Summary|TVD_SPACE_RATIO false       |[review] TVD_SPACE_RATIO
                   |positives -- the FP         |false positives -- the FP
                   |Collection                  |Collection
  Status Whiteboard|                            |needs 2 votes to disable
                   |                            |rule

--- Comment #32 from Henrik Krohns <he...@hege.li> 2011-05-09 12:23:24 UTC ---
Mass-checks look abysmal. This rule seems to be one that can never be fixed for
all cases.

Anyone in favor of score 0.001? +1 from me

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #8 from Karsten Bräckelmann <gu...@rudersport.de>  2009-05-26 15:10:59 PST ---
(In reply to comment #7)
> This line for example would have a ratio of 18% on its own, still 13% with the
> longish header-style prefix and no (munged?) linebreak.
> 
>   Over_to_maintainer_(via_the_GNATS_Auto_Assign_Tool)

Oops, underscores added by me. They are actually spaces.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #20 from Justin Mason <jm...@jmason.org>  2009-07-21 15:51:27 PST ---
(In reply to comment #19)
> http://ruleqa.spamassassin.org/20090721-r796186-n/TVD_SPACE_RATIO/detail
> 
> Why does TVD_SPACE_RATIO have such a high score of 2.90?  It appears to have a
> very low spam hit % and high false positives.

Warren, generally those rules did better when they were first added... often
there was a spam run using those characteristics.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119

Henrik Krohns <he...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED
            Summary|[review] TVD_SPACE_RATIO    |TVD_SPACE_RATIO false
                   |false positives -- the FP   |positives -- the FP
                   |Collection                  |Collection
  Status Whiteboard|needs 2 votes to disable    |
                   |rule                        |

--- Comment #33 from Henrik Krohns <he...@hege.li> 2011-05-09 12:25:14 UTC ---
(In reply to comment #32)
> Mass-checks look abysmal. This rule seems to be one that can never be fixed for
> all cases.
> 
> Anyone in favor of score 0.001? +1 from me

Never mind it was already set at that. :-D Closing.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119

--- Comment #25 from Matus UHLAR - fantomas <uh...@fantomas.sk> 2009-11-13 01:46:32 UTC ---
note that modifying TVD_SPACE_RATIO could positively affect its score too ;)

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119


Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P2                          |P5
                 CC|                            |jm@jmason.org




-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119


Karsten Bräckelmann <gu...@rudersport.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|TVD_SPACE_RATIO FP mail as  |TVD_SPACE_RATIO false
                   |wanted by JM                |positives -- the FP
                   |                            |Collection




--- Comment #1 from Karsten Bräckelmann <gu...@rudersport.de>  2009-05-25 18:15:32 PST ---
(In reply to comment #0)
> please attach FPs you can share to tickets on bugzilla.  they do help.
         ^^^^^^
Yes, indeed. Pretty please with sugar on top, *attach* the raw messages, do
*not* paste them as a comment. See "Add an attachment" in the Attachment list
below the details table, above the comments.  Thanks. :)


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119

--- Comment #26 from Justin Mason <jm...@jmason.org> 2009-11-13 06:27:46 UTC ---
(In reply to comment #24)
> Do you need sample? Unluckily I can't just attach customer's e-mail.

can you come up with something similar (but shareable) which triggers the
issue?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #7 from Karsten Bräckelmann <gu...@rudersport.de>  2009-05-26 14:58:30 PST ---
Just had a quick look at attachment 4452 and the BodyEval tvd_vertical_words()
function, adding some noisy debugging love. The reason is quite simple -- the
space to non-space ratio doesn't exceed 9%, which is less than the default 10%
max.

This didn't become apparent from looking at the code only without the
debugging, though. I expected it to check the body line by line. However, it
actually checks the space ratio for *paragraphs* in a traditional UN*X style.
That paragraph ends with *two* newlines.

This line for example would have a ratio of 18% on its own, still 13% with the
longish header-style prefix and no (munged?) linebreak.

  Over_to_maintainer_(via_the_GNATS_Auto_Assign_Tool)

The text being looked at is the entire paragraph, though, including all lines
immediately preceding or following without an empty line. Resulting in 20/201,
or about 9%. One reason, and an explanation why it loves to hit on such
messages, are the very long words prefixing each line. Or, in other word:
There's not much real, human generated text there. Compare it to this very
paragraph...

A quick and easy fix is, to lower the max threshold (second argument) in
20_body_tests.cf, which currently reads:
  body TVD_SPACE_RATIO  eval:tvd_vertical_words('0','10')

However, given the idea is to identify lots of *vertical* words, I seriously
wonder if this used to work on actual *lines*, rather than whole paragraphs.
Theo?


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #4 from mouss <mo...@netoyen.net>  2009-05-26 12:55:28 PST ---
Created an attachment (id=4453)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4453)
another "sample" (it differs slightly from the one I've attached before)


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119

--- Comment #24 from Matus UHLAR - fantomas <uh...@fantomas.sk> 2009-11-13 01:37:20 UTC ---
I see match when mail of type application/pdf is sent - the pdf isn't an
attachment, it's the body. The mail contains one required blank line between
headers and body (required) and two blank lines at the end - might this be the
problem?

It seems to happen at some fax2mail gateways.
Do you need sample? Unluckily I can't just attach customer's e-mail.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #23 from Justin Mason <jm...@jmason.org>  2009-09-03 14:02:41 PST ---
I've copied those mails into my corpus, and added the fix to avoid hitting on
single-word mails.  I can't see an easy rule fix to avoid the other FP samples
here.  Hopefully the GA will reduce its score.  

fwiw, bug 6155's test scoregen resulted in:

+score TVD_SPACE_RATIO 1.291 0.598 1.799 0.744 # n=2

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119

Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Group|security                    |
          Component|Security                    |Libraries
         AssignedTo|security@spamassassin.apach |dev@spamassassin.apache.org
                   |e.org                       |

--- Comment #28 from Justin Mason <jm...@jmason.org> 2010-01-27 03:16:18 UTC ---
reassigning, too

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119


Matus UHLAR - fantomas <uh...@fantomas.sk> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |uhlar@fantomas.sk




-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #16 from Pim van den Berg <pi...@nethuis.nl>  2009-07-17 04:12:23 PST ---
I have a little addition to this bugreport. I think it is better to have an
initial value like:

$pms->{tvd_vertical_words} = -1;

on line 198 of BodyEval.pm instead of

$pms->{tvd_vertical_words} = 0;

So that the subroutine doesn't return true (0 >= $min) if there are no matches
(ie. if there are no @lines over 5 chars in length).

For example when I send an e-mail with just a single word in the body like
'test', it probably shouldn't match TVD_SPACE_RATIO.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #11 from Karsten Bräckelmann <gu...@rudersport.de>  2009-05-27 03:17:39 PST ---
Attachment 4454. Two paragraphs, the Subject and the entire body including that
mailing-list generated sig, err footer. Ratio of 32 / 321 ~= 0.0997, which
results in 9. Separating the footer from the body (the human generated text)
results in 18 and 5 respectively. Well above the max.

The *cough* human generated body in attachment 4455 doesn't get above a ratio
of 8. Plain paste of a stacktrace and some similar stuff.

My question in comment 7 remains.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #12 from Karsten Bräckelmann <gu...@rudersport.de>  2009-05-27 03:55:22 PST ---
The second mail in attachment 4451, DBMail ID 610117 (same as the paste in
comment 0) got 0/20 spaces for the Subject, and 1/83 spaces in the body. Ratio
of 1, though I kind of would have expected a 0, since the body doesn't actually
have any space.

The first message, DBMail ID 610202, a human generated short note, is NOT a FP.
TVD_SPACE_RATIO just doesn't fire on that one. The note itself has a ratio of
19 (11/57). Just as expected. Michael, why did you put that one in?


It does show some odd results, though. A paragraph seems to always be ended
with a space internally, even though there is no space in the mail and the
single-word paragraph is the very last content. Merely following a newline to
end the message.

That appended space means that tvd_vertical_words() actually does not identify
vertical words, as long as the average length of words in a "paragraph" is not
greater than 10. Like...

Viagra

That one will exonerate the entire message from triggering TVD_SPACE_RATIO. As
would this.

VIAGRA
and
CIALIAS
only
$1.2
per_pill

Seems hardly intended. So, again, was this actually meant to be evaluated per
paragraph or line by line? And an additional question: Was the trailing space
added by SA internally intended to be counted?


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119


Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.3.0




--- Comment #10 from Justin Mason <jm...@jmason.org>  2009-05-27 01:52:05 PST ---
for reference, the ruleqa page:
http://ruleqa.spamassassin.org/20090526-r778623-n/TVD_SPACE_RATIO/detail

and the FP rates on our dev corpora:

MSECS      SPAM%     HAM%     S/O    RANK   SCORE  NAME WHO/AGE
0.00000   1.2180   0.3403   0.782    0.70    2.90  TVD_SPACE_RATIO  
0.00000   1.7544   0.0646   0.964    0.82    2.90  TVD_SPACE_RATIO bb-jm 
0.00000   1.1575   0.1178   0.908    0.82    2.90  TVD_SPACE_RATIO dos 
0.00000   1.5153   0.4726   0.762    0.53    2.90  TVD_SPACE_RATIO jm 
0.00000   0.1232   0.8763   0.123    0.35    2.90  TVD_SPACE_RATIO zmi 


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #19 from Warren Togami <wt...@redhat.com>  2009-07-21 09:58:14 PST ---
http://ruleqa.spamassassin.org/20090721-r796186-n/TVD_SPACE_RATIO/detail

Why does TVD_SPACE_RATIO have such a high score of 2.90?  It appears to have a
very low spam hit % and high false positives.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119

Volodin Arkady <r0...@truehosting.biz> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |r00t@truehosting.biz

--- Comment #29 from Volodin Arkady <r0...@truehosting.biz> 2010-01-28 00:27:55 UTC ---
Generally, by my organization experiens, TVD_SPACE_RATIO is always FP if it was
checked by spamassassin already (f.e. checked at sending MTA or at relay MTA).
I think it is because of long spaces at X-Spam-Report tag. I think developers
must remove system tags before calculate parity between spaces/symbols.

For example message like this:

Return-path: <r0...@domainX.com>
Envelope-to: r00t@domainY.com
Delivery-date: Thu, 28 Jan 2010 11:05:10 +0300
Received: from [A1.B1.C1.D1] (helo=mail.domainX.com)
    by domainY.com with esmtps (TLSv1:AES256-SHA:256)
    (Exim 4.69)
    (envelope-from <r0...@domainX.com>)
    id 1NaPNE-0002HI-4W
    for r00t@domainY.com; Thu, 28 Jan 2010 11:05:10 +0300
Received: from host-1.domainZ.com ([A2.B2.C2.D2] helo=[192.168.60.39])
    by mail.domainX.com with esmtpsa (TLSv1:AES256-SHA:256)
    (Exim 4.71)
    (envelope-from <r0...@domainX.com>)
    id 1NaPMr-000563-Ah
    for r00t@domainY.com; Thu, 28 Jan 2010 08:04:45 +0000
Message-ID: <4B...@domainX.com>
Date: Thu, 28 Jan 2010 11:04:38 +0300
From: R00T <r0...@domainX.com>
User-Agent: Thunderbird 2.0.0.23 (X11/20090812)
MIME-Version: 1.0
To: r00t@domainY.com
Subject: TEST
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -1.4 (-)
X-Spam-Report: Spam detection software, running on the system
"s1010.hostingdns.tv", has
 identified this incoming email as possible spam.  The original message
 has been attached to this so you can view it (if it isn't spam) or label
 similar future email.  If you have any questions, see
 spam@transclaim.ru for details.

 Content preview:  TEST [...] 

 Content analysis details:   (-1.4 points, 5.0 required)

  pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -1.4 ALL_TRUSTED            Passed through trusted hosts only via SMTP
X-Spam-Level: -
X-Spam-Score: 3.0 (+++)
X-Spam-Report: Spam detection software, running on the system
"mail.domainY.com", has
    identified this incoming email as possible spam.  The original message
    has been attached to this so you can view it (if it isn't spam) or label
    similar future email.  If you have any questions, see
    postmaster@domainY.com for details.
    Content preview:  TEST [...] 
    Content analysis details:   (2.9 points, 3.0 required)
    pts rule name              description
    ---- ----------------------
--------------------------------------------------
    2.9 TVD_SPACE_RATIO        BODY: TVD_SPACE_RATIO
    0.1 RDNS_NONE              Delivered to trusted network by a host with no
rDNS
Subject: ***SPAM*** TEST
X-Spam-Level: +++
X-Spam-Status: score=3.0

TEST

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #5 from mouss <mo...@netoyen.net>  2009-05-26 12:57:43 PST ---
Created an attachment (id=4454)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4454)
a sample from the multimedia@FreeBSD.org list


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #18 from Warren Togami <wt...@redhat.com>  2009-07-17 07:51:37 PST ---
It appears that my TVD_SPACE_RATIO FP's are either:

* mail with only a single word or URL in the body.
* various legitimate personal mail written in Japanese.  Aside from being
written in either ISO-2022-JP or UTF-8, written Japanese typically lacks spaces
between words.  The same is true of Chinese.

Unfortunately I am unable to share any of the Japanese mail.  Most are in my
user's folders who agreed to hand classify using our standards.  I will ask for
her to choose a few samples to share, but it will be early next week.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #3 from mouss <mo...@netoyen.net>  2009-05-26 12:54:48 PST ---
Created an attachment (id=4452)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4452)
legit mail triggering TVD_SPACE_RATIO

a lot of messages posted to freebsd-ports-bugs@FreeBSD.org hit this rule. 


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6119] TVD_SPACE_RATIO false positives -- the FP Collection

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6119





--- Comment #15 from Justin Mason <jm...@jmason.org>  2009-05-28 13:37:17 PST ---
(In reply to comment #14)
> Since the Target Milestone is set to 3.3.0, the score issue most likely will be
> resolved magically by a full GA run. What I'd really like to sort about this
> for 3.3.0 is, if the eval really still works as originally intended, or if it
> maybe got broke by evaluating paragraphs instead of lines and the injected
> trailing space.

feel free to move it to 3.2.6 if you want to work on it for sa-update, though!
3.3.0 is just the easiest.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.