You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/07/27 23:36:04 UTC

[Bug 4505] New: Score generation for SpamAssassin 3.1

http://bugzilla.spamassassin.org/show_bug.cgi?id=4505

           Summary: Score generation for SpamAssassin 3.1
           Product: Spamassassin
           Version: 3.1.0
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Score Generation
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: henry@stern.ca


To tune the models this time, I am using a 10% random sample of all of the
corpus submissions.  All of these results have been generated using the same
parameters as I did with 3.0, except for set1.

False positives and negatives from the 10% sample to follow...

./model-statistics vm-set0-2.0-4.0-100/validate
False positives: mean=0.0753% std=0.0462
False negatives: mean=20.9334% std=7.3811
TCR (lambda=50): mean=2.7302 std=0.9718

./model-statistics vm-set1-2.0-4.0-100/validate
False positives: mean=0.0713% std=0.0435
False negatives: mean=5.9736% std=2.1137
TCR (lambda=50): mean=9.8396 std=3.6217

./model-statistics vm-set2-2.0-4.625-100/validate
False positives: mean=0.0847% std=0.0364
False negatives: mean=5.6917% std=2.0176
TCR (lambda=50): mean=9.7449 std=3.4877

./model-statistics vm-set3-2.0-5.0-100/validate
False positives: mean=0.0847% std=0.0527
False negatives: mean=2.9959% std=1.0621
TCR (lambda=50): mean=15.7957 std=6.3287



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] [review] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


duncf@debian.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|                            |1 more needed




------- Additional Comments From duncf@debian.org  2005-08-10 19:43 -------
Justin, can you elaborate on why rule_names.t was failing? I don't see why
FUZZY_VALIUM had the problem, but FUZZY_VIOXX or FUZZY_VICODIN does not.

+1 on all 3



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4505] [review] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


parkerm@pobox.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|1 more needed               |ready to commit




------- Additional Comments From parkerm@pobox.com  2005-08-10 22:08 -------
+1



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

Re: [Bug 4505] Score generation for SpamAssassin 3.1

Posted by Robert Menschel <Ro...@Menschel.net>.
> ------- Additional Comments From jm@jmason.org  2005-07-28 18:15 -------
> btw, more hits that look very iffy, from the freqs file:

>   0.333   0.0546   0.8887    0.058   0.26   -4.30  RCVD_IN_BSP_TRUSTED
>   0.051   0.0130   0.1267    0.093   0.19   -0.10  RCVD_IN_BSP_OTHER
>   0.036   0.0053   0.0961    0.053   0.29   -8.00  HABEAS_ACCREDITED_COI

> that seems like a *LOT* of Bonded Sender spam hits -- 809 messages hitting
> RCVD_IN_BSP_TRUSTED!  could we get those spam hits verified?  (Bob, in
> particular, most seem to be coming from your corpus)

Summary:

Misclassified ham:  28
Bounce/outscatter of spam:  1
Possibly misclassified ham: 34
Constant Contact questionable: 3099 (ham and spam)
The remainder are IMO spam.

Note: In the following discussions where I say "flagged spam", I mean
fully encapsulated, with full SA report and score presented as the
primary email to the user.

> Misclassified ham:

From: newsletters@about.com (count: 7)
From: "American Express" <Am...@email.americanexpress.com>
      count: 10, multiple users fed to sa-learn, primarily because
      instead of being official notifications, statements, alerts,
      etc., the "spam" identified by users were marketing emails,
      "take a look at our special offers", "plan the perfect holiday",
      "upgrade to a card with premium service", etc. Only one of the
      sa-learned "spam" was what I'd consider a ham, though none of
      them are spam.
From: <su...@godaddy.com> (count: 2, 1 to each of 2 users)
From: PayPal <pa...@email.paypal.com> (count: 6)
From: Tikkun <Ti...@democracyinaction.org> (count: 1)
From: HeartCenterOnline <He...@heartcenteronline-mail.com> (count: 2)


> Possibly misclassified ham:

From: "CNET Help.com Online Courses  "
<CN...@newsletter.online.com>
Count: 9
User CR declared it to be spam via sa-learn. Probably old subscription.
Several others not fed to sa-learn, but flagged as spam by our system
(and not corrected by the users via sa-learn).
Willing to consider these ham.

From: "The Home Depot" <Ho...@homedepot.com>
Subject: Great Last-Minute Gifts for Dad
Count 4: Various users, flagged as spam by our system, not fed through
sa-learn. Looked like spam during validation. also have nine emails
from same source, 3 with low positive scores, six with negative
scores, also not fed through sa-learn.
Willing to consider these ham.

From: Godiva.com <go...@godiva.com>
Count 3: User CR declared it to be spam via sa-learn. Might be old
subscription.
Count 1: User SV, flagged as spam by our system, no sa-learn correction.
Note: my unverified corpus also has two more emails from same source,
not flagged as spam (low positive score), not fed to sa-learn.

From: "eBay" <eB...@reply3.ebay.com>  Count: 7
Subject: Preview eBay's Summer Sizzlers & Save Big!
Subject: B-52's Live, BBQ at Great America--register now for eBay Live and save!
Subject: feralcanning, check these amazing eBay deals--all under $10
User CR declared it to be spam via sa-learn. Maybe old subscription,
very likely not the type of email the user wanted from eBay.

From: "Movies Unlimited Video E-Flash" <ef...@moviesunlimitedeflash.com>
Count 3: User SA, system flagged as spam, no sa-learn, look like spam,
but all to single user. Could be ham.

From: "DVD Talk" <ne...@dvdtalk.com>  count: 2
To: mike@misosoup.com
Subject: DVD Talk: It's Back - The Huge DeepDiscountDVD.com Sale
User MM, system flagged as spam, no sa-learn, look like spam, all to
single user, count 2, many others not flagged as spam (some low
positive, some negative), none through sa-learn. Could be ham.

From: "Planet DVD Now" <sa...@planetdvdnow.com>  count: 3
To: ncoronado@prontotax.com
Subject: Planet DVD Now Insider News for Saturday June 18, 2005
User NP, system flagged as spam, no sa-learn, look like spam, all to
single user, count 3, many others not flagged as spam (some low
positive, some negative), none through sa-learn. Could be ham.

From: support@sexsearchcom.com  count: 3
Subject: SexSearch Shown Interest
User JB, flagged spam, no sa-learn. Only user receiving these emails.

> Constant Contact

Per earlier email, several other Constant Contact "newsletters"
flagged by our system as spam, variety of newsletters, variety of
users, spam classification not corrected by users, including technical
users who regularly and reliably sa-learn their misclassified emails.
Messages fed through sa-learn as spam by users:     17
Messages flagged as spam and not sa-learned as ham: 1586
Messages not flagged as spam:                       1496
IMO, if we discard the 1603 flagged as spam, we should also discard
the 1496 treated as ham.

> Sure looks like spam:

From: "Entertainment Update" <En...@mail85.subscribermail.com>
Subject: New Promotional Partner Opportunities
User CR declared it to be spam via sa-learn. Sure looks to me like spam.

From: The Motley Fool <Fo...@foolsubs.com>
Subject: Urgent Stock Buy/Sell Alert...from Motley Fool Stock Advisor
User CR declared it to be spam via sa-learn. Sure looks to me like spam.
Plus another copy flagged as spam by our system, same user, not fed to
sa-learn. Quite a few others, all look like spam.

From: "Entertainment Insider" <En...@mail85.subscribermail.com>
Subject: New Marketing Opportunities from The b EQUAL Company
Subject: New Promotional Opportunities Available from Nickelodeon
Subject: New Marketing Opportunities from Buena Vista Home Entertainment
User CR declared it to be spam via sa-learn. Sure looks to me like spam.
Count: 5

From: Rabbi Michael Lerner <ra...@tikkun.org>
Subject: Science and Spirit--a work group at the Network of Spiritual Progressives Founding Conferences
User RI declared it to be spam via sa-learn. Maybe old subscription,
very likely not the type of email the user wanted from this source.

From: "ArcaMax" <ez...@arcamax.com>
Subject: Congratulations - You Won
User NP declared it to be spam via sa-learn. Sure looks to me like spam.
Two copies, same recipient, different message ids
Third email, also user NP, no sa-learn, flagged as spam by our system,
sure looks like spam to me.
Other emails, various users, no sa-learn, flagged as spam by our
system, look like spam to me.

From: South Beach Diet Online <pr...@southbeachdiet.com>
Subject: why this diet WORKS!
User AM, no sa-learn, flagged as spam by our system.
> You are receiving this message because you subscribed to or visited
> a Waterfront Media newsletter or product."
Visited a newsletter or product = looks like spam to me.

From: DGI Line - asi/50910 <pr...@promotioncorner.com>
Reply-To: promoflash@promotioncorner.com
To: jan@award-source.com
Subject: 2005 Magnetic Football Schedules!  All Pro Teams Available
User JA, no sa-learn, flagged as spam by our system, roving constant
contact, contents look like spam to me.

From: "NewsMax.com" <cu...@reply.newsmax.com>
Subject: Ken Blackwell and New Republicans: Inside Story
User GI, no sa-learn, flagged as spam by our system, only one email in
corpus, including unclassified. If "newsmax.com" were a real service,
I'd expect repeated emails. Therefore I believe this to be spam.

From: Health Insurance Solutions <He...@focalex2.com>
Subject: Health and happiness go hand in hand.
User JC, system flagged as spam, no sa-learn, five separate emails,
all look like spam (including no MID from sender), all to single user,
an insurance agent. Could be ham. But...
From: Medical Insurance <Me...@focalex2.com>
Subject: Take care with medical insurance.
From: US Immigration Help <US...@focalex2.com>
Subject: Make the dream of citizenship a reality.
User JC, system flagged as spam, no sa-learn, multiple emails,
all look like spam (including no MID from sender), all to single user,
an insurance agent. Content very much so aimed at consumer, not agent,
strongly suggesting to me that all email from @focalex2.com is indeed
spam. Then ...
From: Posters And Wall Art <Po...@focalex2.com>
Subject: What your walls want to wear.
Same user (insurance agent), same source, nothing at all to do with
insurance or anything similar to any other email received by this
user. Other spam samples abound in more recent email.

From: "SmartBargains" <Sm...@deals.smartbargains.com>
Reply-To: "SmartBargains" <Sm...@deals.smartbargains.com>
To: srose@cencalins.com
Subject: 320TC Sheet Set, Duvet & More Just $29.95
User SC, system flagged as spam, no sa-learn, all look like spam.
User DT, "
Emails do refer to users by a first name which matches first letter of
email address.
> You are receiving this email because you subscribed to it through
> SmartBargains.com or one of our partners.

From: AIU Online <ai...@aiuonline-update.com>
Subject: Nights. Weekends. We're here when it's convenient for YOU!
Consistent spam, repeated sa-learn as spam, 2 users, plus one
unclassified to third user. Confident this is spam.

From: "International Living" <we...@internationalliving.com>
To: jim@cudney.com
Subject: IL Postcards - Tax Breaks in the Cloud Forest
User JC, many emails flagged spam, many emails not flagged, no
sa-learn. May or may not be spam. Certainly looks like scam.

From: "Martin D. Weiss, Ph.D." <al...@weissinc.com>
Subject: A Personal Invitation from Martin Weiss
User JC, all emails flagged spam, no sa-learn, emails certainly do
look like spam/scam. Sent to only this user.

From: Hersheys Kisses <ki...@prewards.com>
Subject: Complimentary 10 lbs of Hershey~Rs Chocolate
User BQ, clear spam, even in SURBL blacklist.

From: "TopButton" <vi...@TopButton.com>
To: nysale@dvorak.org
Subject: TOP BUTTON VIP - Prada Price Cuts: 4-Days Only
User ND, among the most technically oriented and skilled of our users,
email flagged as spam, no sa-learn, only email from this source in the
entire corpus, looks unquestionably spam.

From: eDiets Extra <ex...@ediets.com>
Subject: Miami Mediterranean Diet: It's Hot!
Users ST and KG, several emails flagged spam, many emails not flagged,
no sa-learn. May or may not be spam. Certainly looks like spam.

Bob Menschel




[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-07-28 18:15 -------
btw, more hits that look very iffy, from the freqs file:

  0.333   0.0546   0.8887    0.058   0.26   -4.30  RCVD_IN_BSP_TRUSTED
  0.051   0.0130   0.1267    0.093   0.19   -0.10  RCVD_IN_BSP_OTHER
  0.036   0.0053   0.0961    0.053   0.29   -8.00  HABEAS_ACCREDITED_COI

that seems like a *LOT* of Bonded Sender spam hits -- 809 messages hitting
RCVD_IN_BSP_TRUSTED!  could we get those spam hits verified?  (Bob, in
particular, most seem to be coming from your corpus)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-07-28 18:04 -------
Created an attachment (id=3044)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3044&action=view)
freqs for scoreset 3, all logs

fyi -- here's the freqs data from 3.1.0's mass-check logs, scoreset 3.

I didn't clear up the misclassifications reporting since the perceptron run,
fwiw; this is just using the rsync'd logs.  so far, though, the FP/FNs reported
are tiny compared to the number of mass-checked messages (1483066 spam, 743761
ham).



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-02 16:26 -------
regarding the Bob's-corpus issue.   I've been pondering this a bit, and I think
we have to leave it out of the rescore run.

Fundamentally, I don't trust the user population involved :(  I think your
users are using "learn as spam" to keep stuff that isn't *strictly* UBE out of
their mail folders; by using those logs, we'd generate score-sets to consider
spam to be "stuff your users don't want" rather than "unsolicited bulk email",
which is what we have to aim towards.

We used to have a spam definition, namely "spam == UBE", up somewhere related
to corpus policy, but I can't find it now.   But in my opinion that still
applies ;)

(to be honest, I'm not sure there's any good way to use someone else's email in
a rescoring run, since I've often wound up saying "yes, I subscribed to that
horrible spammy-looking newsletter that's sending with a misleading HELO
string", even for my own mail.  and you should see Rod's corpus! ;)

--j.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #3065 is|0                           |1
           obsolete|                            |




------- Additional Comments From jm@jmason.org  2005-08-09 21:46 -------
Created an attachment (id=3066)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3066&action=view)
redo of 3065

ok, this one:
- passes t/meta.t
- zeroes rules where -0.1 < score < 0.1
- is otherwise identical.

I haven't redone the STATISTICS files, though. ;)



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From henry@stern.ca  2005-07-27 15:19 -------
The misses can be found on the rsync server in /corpus/scoregen-3.1/falses/

I wanted to put them on BZ, but the file is too big.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From henry@stern.ca  2005-07-29 14:53 -------
I did a run with the full 2M corpus.  Here are the results:

vm-set0-2.0-4.0-100
False positives: mean=0.0625% std=0.0263
False negatives: mean=21.8408% std=7.6947
TCR (lambda=50): mean=2.6218 std=0.9242

vm-set1-2.0-4.0-100
False positives: mean=0.0682% std=0.0263
False negatives: mean=6.1945% std=2.1798
TCR (lambda=50): mean=9.5497 std=3.3674

vm-set2-2.0-4.625-100
False positives: mean=0.0846% std=0.0325
False negatives: mean=7.9603% std=2.8295
TCR (lambda=50): mean=7.3340 std=2.5958

vm-set3-2.0-5.0-100
False positives: mean=0.0822% std=0.0318
False negatives: mean=3.0710% std=1.0898
TCR (lambda=50): mean=15.2954 std=5.4556



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From henry@stern.ca  2005-08-09 12:10 -------
Changing the Bayes scores didn't have an impact on accuracy with newly-generated
scores.  This doesn't say that changing the scores with what was previously
generated does not impact accuracy (we know otherwise).

Do you really want me to generate the scores again?  It's a real ballache but
I'll do it.

Samples: vm-set1-2.0-4.0-100-nobob vm-set1-2.0-4.0-100-nobob-ib
False positives:
        Sample 1: mean=0.0554% std=0.0229
        Sample 2: mean=0.0595% std=0.0252
        Statistically significantly different with confidence 99.2161%
        Estimated difference: -0.0041% +/- 0.0117

False negatives:
        Sample 1: mean=3.3473% std=1.1779
        Sample 2: mean=3.3299% std=1.1745
        Not statistically significantly different (alpha=0.9500)
        Estimated difference: 0.0174% +/- 0.1339

TCR (lambda=50):
        Sample 1: mean=17.2267 std=6.1150
        Sample 2: mean=16.9662 std=6.0300
        Not statistically significantly different (alpha=0.9500)
        Estimated difference: 0.2605 +/- 1.0179

Samples: vm-set3-2.0-5.0-100-nobob vm-set3-2.0-5.0-100-nobob-ib
False positives:
        Sample 1: mean=0.0546% std=0.0282
        Sample 2: mean=0.0575% std=0.0241
        Not statistically significantly different (alpha=0.9500)
        Estimated difference: -0.0028% +/- 0.0651

False negatives:
        Sample 1: mean=1.0845% std=0.5179
        Sample 2: mean=1.2911% std=0.4657
        Not statistically significantly different (alpha=0.9500)
        Estimated difference: -0.2066% +/- 0.8138

TCR (lambda=50):
        Sample 1: mean=37.6074 std=15.3585
        Sample 2: mean=31.9635 std=11.8543
        Not statistically significantly different (alpha=0.9500)
        Estimated difference: 5.6439 +/- 23.5426





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From henry@stern.ca  2005-07-28 09:57 -------
I'd rather that we didn't clean up the logs this way because:

1) You've only removed errors from 10% of the logs.
2) You haven't removed the errors that both you and SA have made.

I'm running a set of cross-validations on the full set now.  If you really want
to remove only the instances where the human was incorrect and the classifier
was correct and not the instances where both the human and the classifier are
incorrect, I will upload the errors to the rsync server when it's finished.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From spamassassin@dostech.ca  2005-08-06 01:36 -------
I just noticed that the proposed 3.1 BAYES_* scores in scoreset 2 are identical
to the 3.0 ones.

So... manually tweaked scores for 3.0 should work just as good with 3.1.  I'm +1
on the BAYES_50-99 scores I posted in comment 41 (which are the scoreset 2
scores copied to scoreset 3).  I really think BAYES_99 should score at least 4.0.

I'm not exactly sure which of Loren's scores Justin is referring to, but I think
3.5 is too low for BAYES_99.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-07-29 19:41 -------
I hacked together something to make ROC curves... take a look.

current SVN trunk:
http://taint.org/xfer/2005/roc_curves_pre_perceptron.png

with the scores in patch 3045:
http://taint.org/xfer/2005/roc_curves_with_3045.png



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From parkerm@pobox.com  2005-07-27 19:56 -------
I let Henry know, but for the record, I looked through all of mine and they are
all good to go.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From felicity@apache.org  2005-07-27 19:54 -------
Subject: Re:  Score generation for SpamAssassin 3.1

On Wed, Jul 27, 2005 at 06:39:22PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Please download and verify that any mails in the FP set that are coming from
> your corpus, are indeed valid ham; and ditto for the FN set being spam.

Ok, checked over the set3 results.

FPs: all valid ham.
FNs: all valid spam.

In full disclosure, several of the spams could be considered
"questionable", namely HGTV newsletters which also include DIY
newsletters.  I was originally receiving them to a hamtrap, but then I
started receiving things I didn't ask for, and then couldn't unsubscribe,
so they got switched to spam instead.

The rest are a various set of things, mostly stock spams, phishing,
several of those German spams from earlier in the year, national lottery
spams, etc.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-10 13:39 -------
Created an attachment (id=3068)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3068&action=view)
fix for test failures caused by 3066

this is an adjunct to 3066; unfortunately make test produces lots of failures
without this patch otherwise.

it's a set of fixes to the test suite, fixing more of the tests to use their
own rules, isntead of relying on the distribution-default ruleset; this patch
adds a new test-suite-specific rules file, so the test suite is more
independent of the basic ruleset.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-03 19:04 -------
anyway, back to the score generation thing, a few items:


1. I'm -1 on using those scores. They look great all-round, *except* for the
Bayes scores:

 56.044  84.1316   0.0375    1.000   0.84    1.89  BAYES_99
  1.716   2.5715   0.0099    0.996   0.83    2.06  BAYES_95
  1.983   2.9654   0.0251    0.992   0.76    2.09  BAYES_80
  1.685   2.5064   0.0463    0.982   0.68    0.37  BAYES_60
 31.996   0.3606  95.0772    0.004   0.60   -2.60  BAYES_00
  4.503   5.9619   1.5927    0.789   0.47    0.00  BAYES_50
  0.311   0.0880   0.7556    0.104   0.36   -0.41  BAYES_05
  0.377   0.1622   0.8048    0.168   0.32   -1.95  BAYES_20
  0.401   0.2655   0.6706    0.284   0.27   -1.10  BAYES_40

(scoreset 3 freqs output.)   note that none of them was permitted above 2
points by the perceptron; those scores have the odd flattening for
BAYES_95/99 we had to fix in 3.0.3 in r165033; and there seems to be
unanimous support on the record for fixing these.

(ok, I'm being a little disingenuous on the last point, as I think someone,
either Daniel or Henry, was ok with letting them float, but they made the
comment on a transitory medium like IRC or IM so it doesn't count. ;)

So I suggest we set them to the static scores and move out of the mutable
section, as done in the attached patch, then get Henry to rerun
the perceptron.   for ease of review, those static scores are:

score BAYES_00 0.0001 0.0001 -2.312 -2.599
score BAYES_05 0.0001 0.0001 -1.110 -1.110
score BAYES_20 0.0001 0.0001 -0.740 -0.740
score BAYES_40 0.0001 0.0001 -0.185 -0.185
score BAYES_50 0.0001 0.0001 0.001 0.001
score BAYES_60 0.0001 0.0001 1.0 1.0
score BAYES_80 0.0001 0.0001 2.0 2.0
score BAYES_95 0.0001 0.0001 3.0 3.0
score BAYES_99 0.0001 0.0001 3.5 3.5

they're a mix of what the perceptron said in that last run, what was used in
3.0.3, and some smoothing (to avoid the FAQs again).


Henry -- any chance you can gzip up the validation set after you run the
perceptron, and put them somewhere?   There's a whole batch of stuff that needs
to be done that needs those.  also, we need to get the statistics in.   I've
updated http://wiki.apache.org/spamassassin/RescoreMassCheck with what I think
needs to be done (steps 5 onwards).

Probably not worth doing those until we vote on the patch / figure out
what to do with the BAYES scores, though.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-07 16:27 -------
Henry: 3051 now has 3 +1s, and can be committed.  It moves the BAYES scores into
an immutable block.  so if you want to give this a go, go ahead and patch that
and check it in, then rerun the perceptron; alternatively, I'll check it in
later if you haven't beaten me to it, and you can rerun perceptron after that.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-06 15:56 -------
OK, I got hold of the logs from Henry, and measured some BAYES scores
against the validation set:

base results from comment 28, gen-set3-2.0-5.0-100-nobob:
# Correctly non-spam:  53070  99.96%
# Correctly spam:     121906  98.49%
# False positives:        21  0.04%
# False negatives:      1872  1.51%
# TCR(l=50): 42.360712  SpamRecall: 98.488%  SpamPrec: 99.983%

copying values from set 2 for set 3:
# Correctly non-spam:  53064  99.95%
# Correctly spam:     122453  98.93%
# False positives:        27  0.05%
# False negatives:      1325  1.07%
# TCR(l=50): 46.272150  SpamRecall: 98.930%  SpamPrec: 99.978%

comment 14:
# Correctly non-spam:  53014  99.85%
# Correctly spam:     123093  99.45%
# False positives:        77  0.15%
# False negatives:       685  0.55%
# TCR(l=50): 27.293936  SpamRecall: 99.447%  SpamPrec: 99.937%

comment 42 (the patch in attachment 3051):
# Correctly non-spam:  53068  99.96%
# Correctly spam:     122509  98.97%
# False positives:        23  0.04%
# False negatives:      1269  1.03%
# TCR(l=50): 51.169078  SpamRecall: 98.975%  SpamPrec: 99.981%

I think 3051 has the best scores.  less FNs, just 2 more FPs,
sane scores.   I'd suggest we just vote on that patch.

If you want to try other values btw -- the logs are in the zone.  do this:

  cd svncheckout/masses
  rm ham.log spam.log
  ln -s
/home/corpus-rsync/corpus/scoregen-3.1/gen-set3-2.0-5.0-100-nobob/NSBASE/ham-test.log
ham.log
  ln -s
/home/corpus-rsync/corpus/scoregen-3.1/gen-set3-2.0-5.0-100-nobob/SPBASE/spam-test.log
spam.log
  vi ../rules/50_scores.cf
  ./fp-fn-statistics --scoreset=3




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-06 11:02 -------
'So... manually tweaked scores for 3.0 should work just as good with 3.1.  I'm +1
on the BAYES_50-99 scores I posted in comment 41 (which are the scoreset 2
scores copied to scoreset 3).  I really think BAYES_99 should score at least 4.0.'

OK, I'm fine with the comment 41 scores, and I agree BAYES_99 should be >= 4.0.
+1.

care to make a patch?



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|needs Henry                 |




------- Additional Comments From jm@jmason.org  2005-08-09 12:12 -------
'Do you really want me to generate the scores again?  It's a real ballache but
I'll do it.'

no, no need.  thanks for checking btw!



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|                            |needs Henry




------- Additional Comments From jm@jmason.org  2005-08-05 13:44 -------
hellooooo! anyone out there? especially Henry, you're on the critical path here
in a big way. This bug is the 3.1.0 blocker.  Once this is done we can release
3.1.0.  As such it's pretty important! 

IMMEDIATELY REQUIRED:

- Henry: gzip up the validation logs set and put them somewhere.  This
  gets you off the critical path for 3.1.0, at least temporarily, since
  we can try out new bayes scores and figure out if a new perceptron
  will need to be run, or if we can just bump the scores manually and
  use the patch you already posted.   Without the validation set,
  we can't get an accurate idea afaik.

- ALL DEVS: decide correct scores for BAYES*.    this requires comments.
  please comment.

- ALL DEVS: if my patch of proposed BAYES* scores meets with your approval
  (which I'd say it probably won't seeing as everyone has their favourites),
  vote +1.  Otherwise create a patch of your own we can vote on. I think
  DOS' and Loren's suggested scores both look ok.

DOWN THE ROAD A BIT:

- Henry: (possibly) rerun the perceptron if the validation logs set
  indicates that it's required.

- ALL DEVS: once there's a new patch with all scores, vote on it so
  it can be applied.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From felicity@apache.org  2005-07-28 19:19 -------
Subject: Re:  Score generation for SpamAssassin 3.1

On Thu, Jul 28, 2005 at 06:15:52PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> that seems like a *LOT* of Bonded Sender spam hits -- 809 messages hitting
> RCVD_IN_BSP_TRUSTED!  could we get those spam hits verified?

My hits are all valid, btw.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From rOD-spamassassin@arsecandle.org  2005-07-28 14:55 -------
My only misclassify:
/Users/rod/spam/Maildir/.spam.2004-12/cur/1103575966.15119_0.blazing.arsecandle.org,S=18955:2,S
is really ham.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-07-29 16:57 -------
oops, missed that.

however, I don't think Bob was talking about the BSP issue in that mail...

Sidney -- I think you're confusing Constant Contact with Return Path -- Return
Path are now partners in the BSP, http://www.returnpath.net/, but afaik
Constant Contact are a different company.  I don't think that's it (although it
may be some of the hits).




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From Bob@Menschel.net  2005-07-31 15:11 -------
I personally would prefer to avoid fixing any Bayes scores so they couldn't
float, but I feel equally strongly that BAYES_99 should score higher than the
others. BAYES_00 is problematic when a Bayes database gets poisoned, but
BAYES_99 generally doesn't have that problem. 

Option 1: Allow all Bayes scores to float, but add code which forces BAYES_99 to
be at least 10% higher than the max score of all other Bayes scores (at least
BAYES_95).

Option 2: Allow all Bayes scores to float, but give BAYES_99 a floor of either
3.5 or 4.0 -- it can float higher if the Perceptron feels it should, but no lower. 

In SARE we sometimes run into a family of rules like Bayes, something like
__RULE_1 -- spam sign # 1
__RULE_2 -- spam sign # 2
__RULE_3 -- spam sign # 3
meta RULE_1 -- rule 1 but not 2 or 3
meta RULE_2 -- rule 2 but not 1 or 3
meta RULE_3 -- rule 3 but not 1 or 2
meta RULE_4 -- rules 1 and 2 but not 3
meta RULE_5 -- rules 1 and 3 but not 2
meta RULE_6 -- rules 2 and 3 but not 1
meta RULE_7 -- rules 1, 2, and 3
The meta rules 1-3 are scored based on their solo hits (the hits of their
__feeder rules), using our standard SARE algorithms.
Assuming that meta rules 4-6 hit fewer ham than 1-3, we score them higher than
1-3, even if their total spam hits are lower (because of the increased
requirements). 
Likewise, meta rule 7 will be scored highest of this family, because it's 
"safest" of the seven rules. 

Would it be worth while opening a new bugz entry for a 3.2 enhancement to
implement some kind of "this rule scores better than that rule if its S/O is at
least as good" linkage? 



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From duncf@debian.org  2005-08-06 22:30 -------
+1 on 3051

It would probably be more valid if we set the bayes score a little higher and
re-ran the perceptron, that way we could get scores over 4 for BAYES_99 without
so many FPs.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From lwilton@earthlink.net  2005-07-29 20:29 -------
Subject: Re:  Score generation for SpamAssassin 3.1

+score BAYES_50 0 0 0.845 0.001 # n=1
+score BAYES_60 0 0 2.312 0.372 # n=1
+score BAYES_80 0 0 2.775 2.087 # n=1
+score BAYES_95 0 0 3.023 2.063 # n=1
+score BAYES_99 0 0 2.960 1.886 # n=1

I think the score for BAYES_99 should be hand tweaked, regardless of what the score generator said.
This was big grief for most people on 3.0 - 3.0.3, and I'd just as soon not see it take until 3.1.3 to apply the same hack again.

          Loren





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


parkerm@pobox.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dev@spamassassin.apache.org






------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-03 19:05 -------
Created an attachment (id=3051)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3051&action=view)
bayes scores




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From lwilton@earthlink.net  2005-08-03 19:49 -------
Subject: Re:  Score generation for SpamAssassin 3.1

FWIW, the data from scoreset 3 more closely supports using the equation (bayes_group-50)/(50/3.5) to calculate the score.  This is quite close to Justin's values above 50, but departs considerably at lower Bayes values:

Group	Set 3	Norm 3.5	Justin 2	Justin 3
0	-2.600	-3.500	-2.312	-2.599
5	-0.410	-3.150	-1.110	-1.110
20	-1.950	-2.100	-0.740	-0.740
40	-1.100	-0.700	-0.185	-0.185
50	0.000	0.000	0.001	0.001
60	0.370	0.700	1.000	1.000
80	2.090	2.100	2.000	2.000
95	2.060	3.150	3.000	3.000
99	1.890	3.430	3.500	3.500

The "Norm 3.5" group matching the above equation is very close to the Perceptron scores for Bayes_20 to Bayes_80.  The Perceptron score for Bayes_05 is just plain wonky, and of course the scores flatten completely at Bayes_80.

Running a simple linear solution to approximate the bayes-20 to bayes-80 scores with a straight line produces a slightly lower value for the constant (3.5) above: 3.3875.  This of course produces slightly less aggessive scores on the top and bottom ends:

Group	Set 3	Norm 3.3875
0	-2.600	-3.388
5	-0.410	-3.049
20	-1.950	-2.033
40	-1.100	-0.678
50	0.000	0.000
60	0.370	0.678
80	2.090	2.033
95	2.060	3.049
99	1.890	3.320





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-09 21:14 -------
ok, working on the meta.t failures and the zeroing scores that are -0.1 < score
< 0.1.

question: has anyone used 'rewrite-cf-with-new-scores' recently?  can it
successfully rewrite these scores in place?

# URIDNSBL
ifplugin Mail::SpamAssassin::Plugin::URIDNSBL
# <gen:mutable>
score URIBL_AB_SURBL 0 3.306 0 3.812
score URIBL_JP_SURBL 0 3.360 0 4.087
score URIBL_OB_SURBL 0 2.617 0 3.008
score URIBL_PH_SURBL 0 2.240 0 2.800
score URIBL_SBL 0 1.094 0 1.639
score URIBL_SC_SURBL 0 3.600 0 4.498
score URIBL_WS_SURBL 0 1.533 0 2.140
# </gen:mutable>
endif # Mail::SpamAssassin::Plugin::URIDNSBL

what happens for me is that they get shoved into the main <gen:mutable> section,
and lose their "ifplugin" scope.  that's obviously bad news, as it means that
manual hand-editing is required to fix it.

is there a working script that avoids that problem?



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From henry@stern.ca  2005-07-28 03:22 -------
I'm not too concerned about a few mis-labeled entries.  All that will happen
from those is that our numbers will look a bit off.  Unless anyone has
objections, I'm going to use the corpus as is and will generate the scores.  The
learning algorithm is stable enough to work around a bit of noise.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From Bob@Menschel.net  2005-07-31 18:42 -------
SM> It's tricky getting a good corpus: ...

In addition to your reasons, a good corpus for local use (it's spam here, and
always spam here) may not be good for global use (it's not spam to users on that
other system over there). And to expand on your
SM> There are people who [sa-learn as spam] not because they are clueless, but
if they don't recognize that something comes from a subscription or just aren't
sure, ...
There are also sources that confound matters -- a user can sign up with them for
one brand, and receive emails from a corporate parent with a different domain name.

SM> And there's Constant Contact who may have found a way around what at first
glance appears to be a good defense against spam.

SM> ... if Constant Contact really is doing that, they must be counting on
low numbers of complaints. 

Apparently they are, based on the large number of cc.com emails here that
qualify for the BSP rules. 

SM> That link I posted to Ironport's site listed the Bonded Sender fees as of
two years ago. It makes it risky for a single customer to spam. But I can see
how Constant Contact could have a business model based on getting paid by a mix
of spammers and hammers. The Bonded Sender fines are based on number of
complaints per million mails. If you want to nail them, get aggressive about
reporting the confirmed RCVD_IN_BSP_TRUSTED spam. ...

My family gets a lot more ham than spam from cc.com, and so in the past on those
rare occasions when we've gotten cc.com spam I've gone directly to them, with
satisfactory results. Given what I'm seeing now in this corpus, I'll send in the
formal complaints to BSP/Ironport, to increase cc.com's incentive to police
their customers. 

SM> So how do you have a clean corpus when it could contain edge cases that are
classified wrong? ...

Or, IMO more correctly, a valid and representative corpus used for scoring
/should/ have edge cases that may or may not be classified wrong -- there's no
other way for a major ISP who can't know what their users did or didn't
subscribe for, to manage their spam. It's important to classify them as
accurately as humanly possible, but for SA to be optimally useful it needs to be
able to make judgments about the edge cases as well, and it can only do that if
we take the risk and include them in our corpus. 

SM> What is the "correct" score for such mail? If the only difference between a
piece of spam and a piece of ham is whether the recipient subscribed to it, how
do you call either one an FP or an FN for the purpose of the rule scoring
program? I don't have answers to that.

First pass suggestion:  Aim to get these "edge" emails into the 2.0-4.0 score
range, so that network tests and hopefully Bayes can push them over 5.0 or under
0.0 as appropriate for the user/site. 





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #3066 is|0                           |1
           obsolete|                            |




------- Additional Comments From jm@jmason.org  2005-08-10 17:58 -------
Created an attachment (id=3069)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3069&action=view)
redo of 3066

well isn't this fun.  it turns out that rule_names.t introduces more
unpredictability in our test suite, and causes *occasional* 'make test'
failures.

FUZZY_VALIUM in rules/25_replace.cf was therefore causing make test failures,
due to its name; this version of the rules patch includes the new scores, the
new stats, and renames that rule to "FUZZY_VLIUM" to avoid this test failure.

the following patch is a fix for t/rule_names.t that removes this
unpredictability.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-09 11:31 -------
hmm, nix that patch.   I've just realised the STATISTICS files don't contain the
freqs.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From lwilton@earthlink.net  2005-08-06 02:06 -------
Subject: Re:  Score generation for SpamAssassin 3.1

> I'm not exactly sure which of Loren's scores Justin is referring to, but I
think
> 3.5 is too low for BAYES_99.

I'm not sure which set either.  I hink that 3.5 *might* be OK with net tests
also.  I think I'd want something closer to 4.0 - 4.5 or even higher without
net tests.  Wasn't it something just shy of 5 in 2.6?





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From cmt-spamassassin@someone.dhs.org  2005-07-28 11:06 -------
Here are my misclassifications (I guess whether or not it matters is still up for debate):

Virus Bounce: 
/home/sone/spamassassin_masscheck_3.1-pre/corpora/spam/1082327516.17711_3.ns1:2,S

Misclassified as spam (kinda sorta ham-ish-y I guess):
/home/sone/spamassassin_masscheck_3.1-pre/corpora/spam/
1115735940.M20350P12544V0000000000000304I001D2C12_6.ns1,S=14073:2,S

Misclassified as spam (really ham)
/home/sone/spamassassin_masscheck_3.1-pre/corpora/spam/1106269507.18978_3.ns1:2,S



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-10 17:59 -------
Created an attachment (id=3070)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3070&action=view)
fix for t/rule_names.t

I think this helps



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From spamassassin@dostech.ca  2005-08-03 18:27 -------
Subject: Re:  Score generation for SpamAssassin 3.1

Same here.  I've been running with 3.0's scoreset 2 scores for both 
scoresets 2 and 3, for BAYES_50-99, with no problems (always using 
scoreset 3).

score BAYES_50 0 0 1.567 1.567
score BAYES_60 0 0 3.515 3.515
score BAYES_80 0 0 3.608 3.608
score BAYES_95 0 0 3.514 3.514
score BAYES_99 0 0 4.070 4.070





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From parkerm@pobox.com  2005-08-03 18:07 -------
The scores for upper BAYES scores (ie 80, 85 and 90) are too low.  We should
lock in the values based on what we saw in the 3.0 release.

Personally I've been running with this in my local.cf for a long while with no
issues:

score BAYES_80 0 0 4.608 3.087
score BAYES_95 0 0 4.514 3.063
score BAYES_99 0 0 5.070 3.886


Granted the 80/95 set3 scores might be a tad high for general consumption.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] [review] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Score generation for        |[review] Score generation
                   |SpamAssassin 3.1            |for SpamAssassin 3.1




------- Additional Comments From jm@jmason.org  2005-08-10 18:01 -------
ok. these patches all need votes, now: 3069, 3068, 3070.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From henry@stern.ca  2005-08-07 00:43 -------
I don't mind doing another validation and scoring run.  Commit a patch with
whatever you want to svn and let me know.  Make sure that the scores are in an
immutable block.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From lwilton@earthlink.net  2005-08-06 03:30 -------
Another suggested set of bayes values:

Bayes	Set 2	Set 3	Eqn 2	Eqn 3
0	-2.312	-2.599	-2.5	-2.6
5	-1.11	-0.413	-1.525	-2.2
20	-0.74	-1.951	-0.7	-2
40	-0.185	-1.096	0.4	-0.78
50	0.912	0.001	0.95	-0.1
60	2.22	0.372	1.8	0.58
80	2.775	2.087	2.7	1.94
95	3.237	2.063	3.425	2.96
99	3.145	1.886	3.645	3.232

The second and third columns are sets 2 and 3 from Henry's data.  The final two 
columns are my proposed values for sets 2 and 3.  These values are not what I 
would really like to see on the high end, but I think are about as high as one 
can somewhat reasonably go based on the data.

Both sets are essentially linear trendlines for sets 2 and 3, with some hand 
corrections to better match what I consider a few important data points.
In particular, bayes_00 for both sets 2 and 3 are close to -2.5.  However the 
trendlines would predict values around -1.7 for set 2 and -3.2 or so for set 
3.  I've moved the bayes_00 point to something that the data will support in 
both cases.  Also both sets show a weakness in bayes_05.  I've pushed the 
bayes_05 trendline values upward for both sets, although not far enough to 
create score inversions.

It should be noted that both original sets indicate a flattening of the bayes 
scores over 80%.  I've left these values as the linear trendline would predict, 
since that seems to be closer to normal human experience.  It must be noted 
though that the data doesn't really support these extrapolations, especially 
for bayes_99.  

Neither bayes_99 score comes close to 4.0.  I tried to play with the data until 
I could get something in that range, but it wouldn't go along with the game.  
It would be possible to tweak the set 2 scores for 95 and 99 upward to aim at 
4.0 without departing too badly from the data.  This wouldn't be possible with 
the set 3 scores.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #3062 is|0                           |1
           obsolete|                            |
         AssignedTo|dev@spamassassin.apache.org |jm@jmason.org




------- Additional Comments From jm@jmason.org  2005-08-09 17:01 -------
Created an attachment (id=3065)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3065&action=view)
redo of 3062

ok, this one's better, includes the freqs!  Please vote.....



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


henry@stern.ca changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
   Target Milestone|Undefined                   |3.1.0






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #3048 is|0                           |1
           obsolete|                            |
Attachment #3051 is|0                           |1
           obsolete|                            |




------- Additional Comments From jm@jmason.org  2005-08-08 19:33 -------
Created an attachment (id=3062)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3062&action=view)
release-quality patch

hey, here's a patch that uses the scores from attachment 3046, plus the bayes
scores from attachment 3051, and includes STATISTICS files for all scoresets.

This is release-quality, if we want to go with this; alternatively, we can wait
for a go-around with the locked-down Bayes scores.

IMO: we should release with these.  set 3 is looking fine as-is, and we're
spending a lot of time on this.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-07 00:30 -------
yeah, I'd like to do another perceptron run with those immutable -- however it
might take too long.  that's up to Henry, really.... in the meantime let's apply
3051.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-07-28 10:12 -------
well, we disagree ;)   I'd appreciate some comments from the rest of the
committers on how they feel about this one.   Here's a chat log between myself
and H talking about it....


(09:49:33) henry: so about fixing up logs
(09:50:19) henry: I'd rather that we didn't because:
1) You've only removed errors from 10% of the logs.
2) You haven't removed the errors that both you and SA has made.
(09:50:25) henry: have made
(09:51:00) jm: please respond via mail on this one, I suspect I'm not the only
one who disagrees ;)
(09:51:18) henry: sure
(09:51:56) jm: imo we need to try and get the logs as clean as poss, even if
we're missing 90% of the FPs/FNs
(09:52:19) henry: we're just gaming the numbers
(09:52:32) jm: even if the perceptron is able to deal with some noise, the logs
are used for other things (STATISTICS.txt) that cannot deal with noise
(09:52:36) henry: the learning algorithm would be useless if it couldn't work
around a few mistakes
(09:52:58) jm: we're not gaming it -- we're using it to build something nearer a
"gold standard" in Cormack temrs
(09:53:13) henry: and what I'm saying is that by correcting errors in only one
direction, STATISTICS.txt will be worse off than it was before
(09:53:24) henry: Cormack uses multiple classifiers to make his "gold standard"
(09:56:27) jm: why are we correcting errors only in 1 dir?
(09:56:31) jm: don't get that
(09:56:54) henry: you're not correcting entries where both you and SA have erred
(09:57:22) henry: so they look like TPs and TNs, but in fact they are FNs and FPs
(09:57:52) jm: ok.   but it's still *better* than the current logs
(09:58:03) henry: I disagree
(09:58:03) jm: in that there are *less* FPs and FNs overall
(09:58:17) jm: even if there are still *some* FPs and FNs
(09:58:19) henry: there are indeed less FPs and FNs overall
(09:58:44) henry: but since we know how many errors we've seen, we can make some
predictions about what's gone on in the other direction
(09:59:49) jm: I disagree that that's useful ;)
(09:59:58) jm: unless you want to fix the STATISTICS generating scripts as well...
(10:01:30) henry: well, here's the thing
(10:01:37) henry: from first look
(10:01:47) henry: it seems that people have about the same amount misclassified
in each direction
(10:01:49) henry: that have been found
(10:02:42) henry: so you could hypothesise that there are plenty that have gone
the other way
(10:03:29) henry: and that they are about the same proportion
(10:03:34) henry: maybe
(10:03:36) henry: I don't know
(10:04:07) henry: all that I can say is that by fixing based solely on the
suspected mistakes of the classifier, we're biasing the results to make things
look better than they are
(10:04:45) henry: and really.. at the end of the day, the numbers reflect how
good the sample set is




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-07-28 09:46 -------
well, in terms of generating STATISTICS.txt at least, I would prefer to have the
bad entries fixed; those numbers are published.  it's pretty trivial to fix up
the logs appropriately using "remove-ids-from-mclog", I'll do it if you want.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From sidney@sidney.com  2005-07-29 16:29 -------
Here's an email Bob sent to sa-dev mailing list that looks like it was meant to
be a comment here. Or if not, I think it should be in the record here and it is
on a public list so I feel free to repost it. However, 259 is a lot less than
792 so there still is a question why Bob has so many Bonded sender FPs.

  ---- rest of this is a quote -----

Hello Henry,

Wednesday, July 27, 2005, 6:39:22 PM, you wrote:


>> jm@jmason.org changed:
>>            What    |Removed                     |Added
>> ----------------------------------------------------------------------------
>>            Severity|normal                      |critical
>>            Priority|P5                          |P1


>> since quite a few of the mass-checkers don't have accounts on that
>> box, I've also copied the set3 files to these URLs: 
>> http://taint.org/xfer/2005/set3.fn.gz
>> http://taint.org/xfer/2005/set3.fp.gz


>> Please download and verify that any mails in the FP set that are
>> coming from your corpus, are indeed valid ham; and ditto for the FN
>> set being spam.


FN:

I spot-checked all FNs with positive scores, and checked every FN with
negative scores.  Corpus is clean, except:

ham: mid=<ma...@ctyme.com>

discount: Message-ID: <12...@agent1.ientrymail.com>
          Message-ID:
<28...@mailagent0.ientrymail.com>
spam newsletter, but this user probably subscribed to it...

There are 259 emails from/via constantcontact.com which are treated
as spam on my system, have been flagged as spam on my system (scores
as high as 30's and 40's), have been encapsulated on delivery, have
never been flagged by any user as not-spam, but, for the purposes of a
world-wide mass-check, these constantcontact.com emails might be
questionable.

Note: Not all constantcontact.com is treated as spam here -- quite a
few cc.com newsletters are subscribed to and seen as ham by their
subscribers and the system. The ones I find above in the fns file are
all from a set of eight newsletters which have regularly (almost
always) been seen as spam, and no user has ever corrected that
classification.

Henry: To remove these from the log (if you want to), remove
everything where the path is
/home/Bob/spamassassin.active/masses/corpus.spam (or corpus.ham),
since that identifies my corpus contribution, and where the mid ends
in @scheduler. 

FP:  Checked every one.  Corpus is clean, except:

ham: Message-ID: <11...@yahoogroups.com>
There are two of these listed. One should be removed.

spam: mid=<17...@hotmail.com>


Bob Menschel







------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-07-28 10:16 -------
so in summary:

- I think we should try to make the logs as clean as possible

- Henry thinks we should keep the logs as they are, and use that to estimate a
misclassification figure instead

(PS: henry also notes that Bayes will have been trained on those instances, too.)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From sidney@sidney.com  2005-07-29 16:43 -------
Of course I should have said FN not FP in the last comment. And in case it is
not clear to someone reading this: constantcontact.com runs the Bonded Sender
service, which is what the RCVD_IN_BSP_TRUSTED rule looks for.

Bob, what does it mean that you say that you have 259 emails from/via
constantcontact.com that are flagged as spam, but Justin says that the log shows
792 BSP_TRUSTED hits from your spam corpus?




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From bas@debian.org  2005-07-28 03:40 -------
MY check of the set3 results gives:

FNS:  (can be moved to spam if you want, or deleted)
/scratch/SA/mails/2005-01.mbox.ham.21322338

FPS:   (can be moved to ham or deleted)
/scratch/SA/mails/personal.2005w08.spam.194510
/scratch/SA/mails/personal.2005w09.spam.780822
/scratch/SA/mails/personal.2005w20.spam.220636
/scratch/SA/mails/personal.2005w21.spam.1340310
/scratch/SA/mails/personal.2005w22.spam.1210785
/scratch/SA/mails/personal.2005w25.spam.886714

INVALID, DELETE FROM SPAM:   (bounces,virusses,etc)
/scratch/SA/mails/backup.2005.jan-may.spam.101747
/scratch/SA/mails/traps.2005w09.spam.1189670
/scratch/SA/mails/personal.2005w09.spam.704311
/scratch/SA/mails/personal.2005w14.spam.1332942
/scratch/SA/mails/personal.2005w28.spam.161488



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Re: [Bug 4505] Score generation for SpamAssassin 3.1

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Henry,

Wednesday, July 27, 2005, 6:39:22 PM, you wrote:

> jm@jmason.org changed:
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>            Severity|normal                      |critical
>            Priority|P5                          |P1

> since quite a few of the mass-checkers don't have accounts on that
> box, I've also copied the set3 files to these URLs: 
> http://taint.org/xfer/2005/set3.fn.gz
> http://taint.org/xfer/2005/set3.fp.gz

> Please download and verify that any mails in the FP set that are
> coming from your corpus, are indeed valid ham; and ditto for the FN
> set being spam.

FN:

I spot-checked all FNs with positive scores, and checked every FN with
negative scores.  Corpus is clean, except:

ham: mid=<ma...@ctyme.com>

discount: Message-ID: <12...@agent1.ientrymail.com>
          Message-ID: <28...@mailagent0.ientrymail.com>
spam newsletter, but this user probably subscribed to it...

There are 259 emails from/via constantcontact.com which are treated
as spam on my system, have been flagged as spam on my system (scores
as high as 30's and 40's), have been encapsulated on delivery, have
never been flagged by any user as not-spam, but, for the purposes of a
world-wide mass-check, these constantcontact.com emails might be
questionable.

Note: Not all constantcontact.com is treated as spam here -- quite a
few cc.com newsletters are subscribed to and seen as ham by their
subscribers and the system. The ones I find above in the fns file are
all from a set of eight newsletters which have regularly (almost
always) been seen as spam, and no user has ever corrected that
classification.

Henry: To remove these from the log (if you want to), remove
everything where the path is
/home/Bob/spamassassin.active/masses/corpus.spam (or corpus.ham),
since that identifies my corpus contribution, and where the mid ends
in @scheduler. 

FP:  Checked every one.  Corpus is clean, except:

ham: Message-ID: <11...@yahoogroups.com>
There are two of these listed. One should be removed.

spam: mid=<17...@hotmail.com>


Bob Menschel



[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |critical
           Priority|P5                          |P1




------- Additional Comments From jm@jmason.org  2005-07-27 18:39 -------
(bumping pri to the appropriate level)

since quite a few of the mass-checkers don't have accounts on that box, I've
also copied the set3 files to these URLs:

http://taint.org/xfer/2005/set3.fn.gz
http://taint.org/xfer/2005/set3.fp.gz

Please download and verify that any mails in the FP set that are coming from
your corpus, are indeed valid ham; and ditto for the FN set being spam.

Btw Henry -- in my case, the breakdown of errors is as follows...

FNS:  (can be moved to spam if you want, or deleted)
/home/jm/Mail/deld.priv/56232
/home/jm/Mail/deld.priv/61238
/home/jm/Mail/sent/587
/home/jm/Mail/sent/736

INVALID, DELETE FROM HAM:   (rule discussion, bounced spam)
/home/jm/Mail/deld.priv/111034
/home/jm/Mail/A3inbox/1

FPS:   (can be moved to ham or deleted)
/home/jm/cor/spam.cor/20041029a/216
/home/jm/cor/spam.cor/20041029a/226
/home/jm/cor/spam.cor/20041029a/246
/home/jm/cor/spam.cor/20041029a/233
/home/jm/cor/spam.cor/20041029a/235

INVALID, DELETE FROM SPAM:   (bounced spam)
/home/jm/Mail/Sapm/1540
/home/jm/Mail/Sapm/1647




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-07 18:13 -------
ok, I got that chance; 3051 is now applied.

trunk:
Sending        rules/50_scores.cf
Transmitting file data .
Committed revision 230721.

b3_1_0:
Sending        rules/50_scores.cf
Transmitting file data .
Committed revision 230723.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From duncf@debian.org  2005-08-06 22:58 -------
What I meant to say was that we should set the BAYES scores explicitly and make
them immutable, then re-run the perceptron. In that case, I'd rather see
slightly higher bayes scores, closer to those in coment 40 or comment 41
(probably in between). I'd like to see about 4.5 for BAYES_99.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] [review] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED




------- Additional Comments From jm@jmason.org  2005-08-11 17:06 -------
ok! applied, 231543 and 231544.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #3044 is|0                           |1
           obsolete|                            |




------- Additional Comments From jm@jmason.org  2005-08-02 13:45 -------
Created an attachment (id=3048)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3048&action=view)
freqs for scoreset 3, all logs, all rules

Daniel noticed that the freqs file I posted was missing SPF_PASS (for some
reason, it's listed as a userconf rule, dunno why).  here's a copy that does.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From sidney@sidney.com  2005-07-29 17:04 -------
Oh, I got confused by this:

http://www.constantcontact.com/services/bonded-sender-program.jsp

I guess constantcontact provides a way for people to get Bonded Sender status
for $25/month and no risk of losing a bond. I wonder what the business model is
if spammers take advantage of it. That would explain it if they are a source of
most of the BSP_TRUSTED FNs. Except there are still another 533 FNs to explain.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-07-29 16:03 -------
further info regarding the BSP_TRUSTED hits --

grep BSP_TRUSTED spam.log > o
perl -ne '/ (\/[^\/]+\/[^\/]+\/[^\/]+)/ and print "$1\n"' o | uniq -c
 792 /home/Bob/spamassassin.active
  10 /home/duncf/Maildir
   2 /home/jm/Mail
   1 /home/jm/cor
   4 /home/corpus/mail
   1 /home/corpus/SA

97% of the Bonded Sender hits on spam are from Bob's corpus.   I suspect
something's up with the corpus there... spamtraps?  retired accounts?


PS: there's an argument that having FPs in the logs is irrelevant.
however, I disagree -- the Perceptron is only *one* thing that uses
the logs.  There are also the following:

  - overall FP/FN% figures for scoresets and thresholds (STATISTICS.txt)
  - rule freqs, for per-rule FP/FN% figures

Given those two, there's good reasons to clean up the logs.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From henry@stern.ca  2005-07-29 17:29 -------
Created an attachment (id=3045)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3045&action=view)
Proposed scores for 3.1

gen-set0-2.0-4.0-100
# SUMMARY for threshold 5.0:
# Correctly non-spam:  74239  99.92%
# Correctly spam:     113219  76.56%
# False positives:	  60  0.08%
# False negatives:     34655  23.44%
# TCR(l=50): 3.927075  SpamRecall: 76.565%  SpamPrec: 99.947%

gen-set1-2.0-4.0-100
# Correctly non-spam:  74274  99.92%
# Correctly spam:     138015  93.05%
# False positives:	  59  0.08%
# False negatives:     10312  6.95%
# TCR(l=50): 11.184361	SpamRecall: 93.048%  SpamPrec: 99.957%

gen-set2-2.0-4.625-100
# Correctly non-spam:  74747  99.92%
# Correctly spam:     134723  90.61%
# False positives:	  58  0.08%
# False negatives:     13955  9.39%
# TCR(l=50): 8.821003  SpamRecall: 90.614%  SpamPrec: 99.957%

gen-set3-2.0-5.0-100
# Correctly non-spam:  74528  99.92%
# Correctly spam:     143427  96.65%
# False positives:	  59  0.08%
# False negatives:	4975  3.35%
# TCR(l=50): 18.725804	SpamRecall: 96.648%  SpamPrec: 99.959%




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From lwilton@earthlink.net  2005-07-30 21:48 -------
Subject: Re:  Score generation for SpamAssassin 3.1

> BTW weren't we planning to set the BAYES_ scores non-mutable?
> can't quite recall.

I know there had been talk of it, although I'm too lazy to try to dig up the
thread.

I think, if it isn't too much work, what I'd like to see would be something
like taking the final generated scoreset, normalizing the bayes numbers for
all sets to ascending sequence more or less*, and then locking them and
rerunning the score generation to get updated values for the other rules.

*    From the data I looked at in Henry's posting, I seem to recall that 05
and 99 were obviously out of sequence.  I think 99 is the critical one to
have in sequence.  05 may be correct where it is, even though out of
sequence.  Perhaps a topic for discussion.

        Loren





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From Bob@Menschel.net  2005-08-06 22:53 -------
+1 on 3051, and I agree it'd be good to see whether a perceptron run would back
out those two extra FPs (though I'm not overly concerned about just two FPs). 



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4505] [review] Score generation for SpamAssassin 3.1

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From jm@jmason.org  2005-08-10 20:07 -------
'Justin, can you elaborate on why rule_names.t was failing? I don't see why
FUZZY_VALIUM had the problem, but FUZZY_VIOXX or FUZZY_VICODIN does not.'

FUZZY_VALIUM contained "VALIUM" which was firing on DRUGS_ANXIETY
(__DRUGS_ANXIETY_3 to be exact).   I couldn't see exactly why, but it certainly
was firing on that bit of the name ;)

I have no idea why VIOXX/VICODIN aren't firing, although the __DRUGS_FOO_N rules
all seem to have individual subrules for each drug, and some have \b and some
have other start-of-string markers.  rule_names.t is a bit of a combinatorial
lucky dip I think. :(



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.