You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Bob Proulx <bo...@proulx.com> on 2008/09/20 18:12:33 UTC

Score Hit Frequency in SA Corpus?

Are the hit frequencies from the SpamAssassin corpus available on the
web somewhere?  I looked through the docs and wiki but didn't see it
if they were.

What is the hit frequency in the corpus of SUBJ_ALL_CAPS scoring 2.1?
I wanted to know so that I could educate a sender that using all caps
in a long subject makes it look significantly like spam but couldn't
deduce the statistical numbers.

Thanks
Bob

Re: Score Hit Frequency in SA Corpus?

Posted by Bob Proulx <bo...@proulx.com>.
Justin Mason wrote:
> Joseph Brennan writes:
> > >> OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
> > >>   1.116   1.5957   0.2705    0.855   0.51    2.08  SUBJ_ALL_CAPS
> > No, it's high.  Only 1.87% had all caps subject, but of those 85%
> > were spam: 1.60 / 1.87.
> > If I am reading correctly.
> 
> That's right.  

Ah...  That makes more sense to me now.  Thanks for the
clarification.

> The problem with SUBJ_ALL_CAPS is that it tends to catch really odd
> fraud spams, foreign-language spam etc. that the other rules fail to
> spot; this means that the GA likes it quite a lot, since despite 
> the occasional FP, it reduces FNs enough to make it "worth it".

Sure.  All is good here.

> it's hard to avoid this issue. :(

Let me stress that I wasn't unhappy with this rule.  It isn't scored
enough by itself anyway to create a FP.  It was just a part of several
things.  It is just something that people can affect by creating the
messages either one way or another.  So the visibility is because it
is such a simple thing that a sender can do to affect the result.

Thanks for the explanations of the hit ratios!

Bob

Re: Score Hit Frequency in SA Corpus?

Posted by Joseph Brennan <br...@columbia.edu>.

--On Sunday, September 21, 2008 18:39 -0600 Bob Proulx <bo...@proulx.com> 
wrote:

>> OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
>>   1.116   1.5957   0.2705    0.855   0.51    2.08  SUBJ_ALL_CAPS
>
> Am I reading that correctly to see that in spam all caps showed up in
> 1.60% of the regression corpus and only in 0.27% of the non-spam?
> Gosh that seems like a very small indicator.


No, it's high.  Only 1.87% had all caps subject, but of those 85%
were spam: 1.60 / 1.87.

If I am reading correctly.

Joseph Brennan
Columbia University Information Technology



Re: Score Hit Frequency in SA Corpus?

Posted by Bob Proulx <bo...@proulx.com>.
Thanks Daryl and Matt,

Daryl C. W. O'Shea wrote:
> On the web, http://ruleqa.spamassassin.org/

Thanks!

> > What is the hit frequency in the corpus of SUBJ_ALL_CAPS scoring 2.1?
>
> OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
>   1.116   1.5957   0.2705    0.855   0.51    2.08  SUBJ_ALL_CAPS

Am I reading that correctly to see that in spam all caps showed up in
1.60% of the regression corpus and only in 0.27% of the non-spam?
Gosh that seems like a very small indicator.

Matt Kettler wrote:
> You can also grab them from the web image of SVN:
> http://svn.apache.org/repos/asf/spamassassin/branches/3.2/rules/

Cool stuff.

> However, bear in mind, scores are not assigned based on the S/O of the
> rule alone. The whole ruleset is scored collectively as one giant group,
> and tuned to get the best results.

Through the genetic algorithm, yes.  And I know the score on this rule
is just a component.  But in this announcement message (congratulating
someone on an acheivement) they were using all caps like crazy along
with other things and the sum total of things made it difficult to
distinguish the message from a typical spam message.

I didn't think the SA rule here was undesirable in this case.  The
message was hard for my eye on a quick glance to distinguish and so I
wanted to educate the sender to improve the announcement messages in
the future.  But then I wondered how many spam messages actually sent
things in all caps.  That used to be more true in the old days but not
so much these days.  As far as I know.  But I figured the spam corpus
would provide the data and I didn't figure out how to find it and so
decided to ask.

> A rule with a high-ish score, and not so great S/O suggests this rule's
> false positives commonly coincide with strong negative scoring rules.
> Based on that, the score assignment system will give it a "unfairly
> high" score because it results in fewer FPs than assigning a higher
> score to some other rule that has a better S/O, but its nonspam hits are
> not compensated by negative scoring rule and would result in more FPs.
> 
> The whole thing gets a lot complicated, but when you start to realize
> that every rule's score is not only a function of its own hit-rate, but
> also what other rules it gets grouped with you start to get a feel for
> what's going on. Of course, strictly evaluating all combinations of all
> rules would be very computationally expensive, which is why we use a
> perceptron which generates an estimate. (I believe it's an successive
> approximation type deal, but I'm not deeply familiar with its internal
> workings)

All good background information.  Thanks for educating me.

Again, just to be pedantic, I didn't have a complaint about
SUBJ_ALL_CAPS.  I think it is okay.  But the above does explain the
score with DRUGS_STOCK_MIMEOLE perhaps.  That was my other message and
I do think it scores those messages too agressively since it looks
like it hits with a normal version of MS Outlook.  But that is already
logged in the tracker.

Thanks!
Bob

Re: Score Hit Frequency in SA Corpus?

Posted by Matt Kettler <mk...@verizon.net>.
Bob Proulx wrote:
> Are the hit frequencies from the SpamAssassin corpus available on the
> web somewhere?  I looked through the docs and wiki but didn't see it
> if they were.
>
> What is the hit frequency in the corpus of SUBJ_ALL_CAPS scoring 2.1?
> I wanted to know so that I could educate a sender that using all caps
> in a long subject makes it look significantly like spam but couldn't
> deduce the statistical numbers.
>   
It's included in the distribution tarball. In the rules subdirectory
check out STATISTICS-setX.txt, where X is the scoreset you're interested
in the stats for.

You can also grab them from the web image of SVN:

http://svn.apache.org/repos/asf/spamassassin/branches/3.2/rules/

And for what it's worth, the S/O is 0.855 in set 3.

However, bear in mind, scores are not assigned based on the S/O of the
rule alone. The whole ruleset is scored collectively as one giant group,
and tuned to get the best results.

A rule with a high-ish score, and not so great S/O suggests this rule's
false positives commonly coincide with strong negative scoring rules.
Based on that, the score assignment system will give it a "unfairly
high" score because it results in fewer FPs than assigning a higher
score to some other rule that has a better S/O, but its nonspam hits are
not compensated by negative scoring rule and would result in more FPs.

The whole thing gets a lot complicated, but when you start to realize
that every rule's score is not only a function of its own hit-rate, but
also what other rules it gets grouped with you start to get a feel for
what's going on. Of course, strictly evaluating all combinations of all
rules would be very computationally expensive, which is why we use a
perceptron which generates an estimate. (I believe it's an successive
approximation type deal, but I'm not deeply familiar with its internal
workings)








Re: Score Hit Frequency in SA Corpus?

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 20/09/2008 12:12 PM, Bob Proulx wrote:
> Are the hit frequencies from the SpamAssassin corpus available on the
> web somewhere?  I looked through the docs and wiki but didn't see it
> if they were.

On the web, http://ruleqa.spamassassin.org/

In the tarball, rules/STATISTICS*

> What is the hit frequency in the corpus of SUBJ_ALL_CAPS scoring 2.1?

OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
  1.116   1.5957   0.2705    0.855   0.51    2.08  SUBJ_ALL_CAPS


Daryl