You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by micah anderson <mi...@riseup.net> on 2018/08/14 15:38:27 UTC

Understanding ruleQA results

Hi,

I'm trying to understand the ruleQA results because I'm trying to track
down how common the rule FRNAME_IN_MSG_NO_SUBJ is spammy.

I load the latest rules: http://ruleqa.spamassassin.org/20180813-r1837926-n/FRNAME_IN_MSG_NO_SUBJ/detail?s_corpus=1&s_g_over_time=1#overtime

and I see the S/O value is 1.0, which is a rule that hits only on spam
(a rule that only hits on ham is 0.0, a rule that doesn't anything is
0.5)... but how can I tell how many messages are part of the corpus?

Also, the percentages seem very low: 1.5192% Spam, and .0005%
Ham... 1.5% seems low to me to be adding 3.5 score to this rule, but
what do I know... which is why I'm asking.

thanks!


-- 
        micah

Re: Understanding ruleQA results

Posted by John Hardin <jh...@impsec.org>.
On Tue, 14 Aug 2018, micah anderson wrote:

> John Hardin <jh...@impsec.org> writes:
>
>> On Tue, 14 Aug 2018, RW wrote:
>>
>>> On Tue, 14 Aug 2018 13:24:47 -0700 (PDT)
>>> John Hardin wrote:
>>>
>>>> On Tue, 14 Aug 2018, micah anderson wrote:
>>>>
>>>
>>>>> I searched my pile of mail that I have from two ice ages ago, and I
>>>>> did find 6 messages that were hits of this rule, one of them was
>>>>> spam, five of them were this person trying to contact me.
>>>>
>>>> ...without a subject?
>>>>
>>>>>> Do you happen to be seeing FPs with this rule?
>>>>>
>>>>> Yes, its why I am investigating it. I think it is common for people
>>>>> who are sending mail from their mobiles, where they use it more
>>>>> like a quick chat instead of a 'regular mail'....
>>>>>
>>>>> In fact, this person used:
>>>>> X-Mailer: iPad Mail (15F79)
>>>>
>>>> OK, I can see about adding some mobile MUA exclusions. Any FP headers
>>>> you can provide (directly) will be helpful. Go ahead and sanitize the
>>>> recipient info, I don't think that would be relevant to tuning this
>>>> one.
>
> I'll provide some pastebin links in a separate email.
>
>>> I don't know that this is particularly specific to mobile, lots of
>>> people send emails with an empty subject.
>>>
>>> It sounds like the main cause would be a signature that contains the
>>> senders name as the only thing in a line. That'll be why all the
>>> FPs mentioned above came from the same person.
>
> Yes, this person has as their signature their name on one line, and
> their From: has that same name listed.
>
>> Question: were those messages scored as spam?
>
> yes, they were, will include the reports in the off-list email.

Has the DKIM exclusion reduced or eliminated your false positives?


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Sheep have only two speeds: graze and stampede.     -- LTC Grossman
-----------------------------------------------------------------------
  7 days until the 1939th anniversary of the destruction of Pompeii

Re: Understanding ruleQA results

Posted by micah anderson <mi...@riseup.net>.
John Hardin <jh...@impsec.org> writes:

> On Tue, 14 Aug 2018, RW wrote:
>
>> On Tue, 14 Aug 2018 13:24:47 -0700 (PDT)
>> John Hardin wrote:
>>
>>> On Tue, 14 Aug 2018, micah anderson wrote:
>>>
>>
>>>> I searched my pile of mail that I have from two ice ages ago, and I
>>>> did find 6 messages that were hits of this rule, one of them was
>>>> spam, five of them were this person trying to contact me.
>>>
>>> ...without a subject?
>>>
>>>>> Do you happen to be seeing FPs with this rule?
>>>>
>>>> Yes, its why I am investigating it. I think it is common for people
>>>> who are sending mail from their mobiles, where they use it more
>>>> like a quick chat instead of a 'regular mail'....
>>>>
>>>> In fact, this person used:
>>>> X-Mailer: iPad Mail (15F79)
>>>
>>> OK, I can see about adding some mobile MUA exclusions. Any FP headers
>>> you can provide (directly) will be helpful. Go ahead and sanitize the
>>> recipient info, I don't think that would be relevant to tuning this
>>> one.

I'll provide some pastebin links in a separate email.

>> I don't know that this is particularly specific to mobile, lots of
>> people send emails with an empty subject.
>>
>> It sounds like the main cause would be a signature that contains the
>> senders name as the only thing in a line. That'll be why all the
>> FPs mentioned above came from the same person.

Yes, this person has as their signature their name on one line, and
their From: has that same name listed.

> Question: were those messages scored as spam?

yes, they were, will include the reports in the off-list email.

-- 
        micah

Re: Understanding ruleQA results

Posted by John Hardin <jh...@impsec.org>.
On Wed, 15 Aug 2018, RW wrote:

> On Tue, 14 Aug 2018 18:43:52 -0700 (PDT)
> John Hardin wrote:
>
>> On Tue, 14 Aug 2018, RW wrote:
>
>>> I don't know that this is particularly specific to mobile, lots of
>>> people send emails with an empty subject.
>>>
>>> It sounds like the main cause would be a signature that contains the
>>> senders name as the only thing in a line. That'll be why all the
>>> FPs mentioned above came from the same person.
>>
>> Question: were those messages scored as spam?
>
> MISSING_SUBJECT + BAYES_50 + FRNAME_IN_MSG_NO_SUBJ scores 6.098
>
>
> If I'm reading the score-map correctly (and 4 represents 4.000 to
> 4.999), then limiting the score to 2.0 seems like a reasonable
> compromise.
>
>
> scoremap spam:  1   0.17%    2
> scoremap spam:  3   1.99%   23
> scoremap spam:  4  88.86% 1029 ***********************************
> scoremap spam:  5   3.28%   38 *

OK, I'll drop the score limit on the FRNAME_IN_MSG rules a bit.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Venezuela is busy reaping the benefits of Socialism:
   in one year 75% of the population has, on average, lost 19 pounds
   due to insufficient food, and 82% of households are below the
   poverty line. (2016 Venezuelan "Living Conditions Survey")
-----------------------------------------------------------------------
  Today: the 73rd anniversary of the end of World War II

Re: Understanding ruleQA results

Posted by RW <rw...@googlemail.com>.
On Tue, 14 Aug 2018 18:43:52 -0700 (PDT)
John Hardin wrote:

> On Tue, 14 Aug 2018, RW wrote:

> > I don't know that this is particularly specific to mobile, lots of
> > people send emails with an empty subject.
> >
> > It sounds like the main cause would be a signature that contains the
> > senders name as the only thing in a line. That'll be why all the
> > FPs mentioned above came from the same person.  
> 
> Question: were those messages scored as spam?

MISSING_SUBJECT + BAYES_50 + FRNAME_IN_MSG_NO_SUBJ scores 6.098


If I'm reading the score-map correctly (and 4 represents 4.000 to
4.999), then limiting the score to 2.0 seems like a reasonable
compromise.


scoremap spam:  1   0.17%    2 
scoremap spam:  3   1.99%   23 
scoremap spam:  4  88.86% 1029 ***********************************
scoremap spam:  5   3.28%   38 *





Re: Understanding ruleQA results

Posted by John Hardin <jh...@impsec.org>.
On Tue, 14 Aug 2018, RW wrote:

> On Tue, 14 Aug 2018 13:24:47 -0700 (PDT)
> John Hardin wrote:
>
>> On Tue, 14 Aug 2018, micah anderson wrote:
>>
>
>>> I searched my pile of mail that I have from two ice ages ago, and I
>>> did find 6 messages that were hits of this rule, one of them was
>>> spam, five of them were this person trying to contact me.
>>
>> ...without a subject?
>>
>>>> Do you happen to be seeing FPs with this rule?
>>>
>>> Yes, its why I am investigating it. I think it is common for people
>>> who are sending mail from their mobiles, where they use it more
>>> like a quick chat instead of a 'regular mail'....
>>>
>>> In fact, this person used:
>>> X-Mailer: iPad Mail (15F79)
>>
>> OK, I can see about adding some mobile MUA exclusions. Any FP headers
>> you can provide (directly) will be helpful. Go ahead and sanitize the
>> recipient info, I don't think that would be relevant to tuning this
>> one.
>
>
> I don't know that this is particularly specific to mobile, lots of
> people send emails with an empty subject.
>
> It sounds like the main cause would be a signature that contains the
> senders name as the only thing in a line. That'll be why all the
> FPs mentioned above came from the same person.

Question: were those messages scored as spam?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   When people get used to preferential treatment,
   equal treatment seems like discrimination.         -- Thomas Sowell
-----------------------------------------------------------------------
  Tomorrow: the 73rd anniversary of the end of World War II

Re: Understanding ruleQA results

Posted by RW <rw...@googlemail.com>.
On Tue, 14 Aug 2018 13:24:47 -0700 (PDT)
John Hardin wrote:

> On Tue, 14 Aug 2018, micah anderson wrote:
> 

> > I searched my pile of mail that I have from two ice ages ago, and I
> > did find 6 messages that were hits of this rule, one of them was
> > spam, five of them were this person trying to contact me.  
> 
> ...without a subject?
> 
> >> Do you happen to be seeing FPs with this rule?  
> >
> > Yes, its why I am investigating it. I think it is common for people
> > who are sending mail from their mobiles, where they use it more
> > like a quick chat instead of a 'regular mail'....
> >
> > In fact, this person used:
> > X-Mailer: iPad Mail (15F79)  
> 
> OK, I can see about adding some mobile MUA exclusions. Any FP headers
> you can provide (directly) will be helpful. Go ahead and sanitize the 
> recipient info, I don't think that would be relevant to tuning this
> one.


I don't know that this is particularly specific to mobile, lots of
people send emails with an empty subject. 

It sounds like the main cause would be a signature that contains the
senders name as the only thing in a line. That'll be why all the
FPs mentioned above came from the same person.

Re: Understanding ruleQA results

Posted by John Hardin <jh...@impsec.org>.
On Tue, 14 Aug 2018, micah anderson wrote:

> John Hardin <jh...@impsec.org> writes:
>
>> On Tue, 14 Aug 2018, micah anderson wrote:
>>
>>> John Hardin <jh...@impsec.org> writes:
>>>
>>>> On Tue, 14 Aug 2018, micah anderson wrote:
>>
>> OK, I can see about adding some mobile MUA exclusions. Any FP headers you
>> can provide (directly) will be helpful. Go ahead and sanitize the
>> recipient info, I don't think that would be relevant to tuning this one.
>
> I put 4 of the messages here:
>
> https://pastebin.com/YuPtBQXN
>
> thanks for your help!
>
> micah

Thanks.

Yesterday I added a FP avoidance check for DKIM based on the (very few) 
ham hits that are in the masscheck corpus; it seems that should be enough 
to avoid these messages as he's sending via gmail and it adds DKIM.

I'm adding some xmailer subrules - the mobile MUA coverage is thin. I 
don't expect to see a lot of overlap, but I may add them anyway based on 
your report.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Politicians never accuse you of "greed" for wanting other people's
   money, only for wanting to keep your own money.    -- Joseph Sobran
-----------------------------------------------------------------------
  Today: the 73rd anniversary of the end of World War II

Re: Understanding ruleQA results

Posted by micah anderson <mi...@riseup.net>.
John Hardin <jh...@impsec.org> writes:

> On Tue, 14 Aug 2018, micah anderson wrote:
>
>> John Hardin <jh...@impsec.org> writes:
>>
>>> On Tue, 14 Aug 2018, micah anderson wrote:
>
> OK, I can see about adding some mobile MUA exclusions. Any FP headers you 
> can provide (directly) will be helpful. Go ahead and sanitize the 
> recipient info, I don't think that would be relevant to tuning this one.

I put 4 of the messages here:

https://pastebin.com/YuPtBQXN

thanks for your help!

micah

Re: Understanding ruleQA results

Posted by John Hardin <jh...@impsec.org>.
On Tue, 14 Aug 2018, micah anderson wrote:

> John Hardin <jh...@impsec.org> writes:
>
>> On Tue, 14 Aug 2018, micah anderson wrote:
>>
>>> but how can I tell how many messages are part of the corpus?
>>
>> As RW said, hover over the percentages.
>
> Thanks.
>
>>> Also, the percentages seem very low: 1.5192% Spam, and .0005%
>>> Ham... 1.5% seems low to me to be adding 3.5 score to this rule, but
>>> what do I know... which is why I'm asking.
>>
>> It's not so much the raw amount of spam it hits, it's that it hits spam
>> that few other rules hit, or that it hits spam that other rules hit but
>> that doesn't score high enough with those other rules.
>>
>> You also want to look at the score-map section when evaluating a rule.
>
> Is there an explanation of the score-map section somewhere?
>
> For this one it says:
>
>  scoremap  ham:  0  33.33%    1 *************
>  scoremap  ham:  1  66.67%    2 **************************
>  scoremap spam:  1   0.08%   15
>  scoremap spam:  3   0.61%  121
>  scoremap spam:  4  90.24% 17791 ************************************
>  scoremap spam:  5   2.69%  531 *
>  scoremap spam:  6   4.54%  896 *
>  scoremap spam:  7   1.10%  217
>  scoremap spam:  8   0.26%   52
>  scoremap spam:  9   0.40%   79
>  scoremap spam: 10   0.01%    2
>  scoremap spam: 11   0.05%    9
>  scoremap spam: 14   0.01%    2
>
> What are these columns and how can I interpret it?

ham/spam: what it hit

The number after ham/spam is the points the message earned. Unfortunately 
I don't know offhand whether or not that includes *this rule*. I'd have to 
go digging in the code to determine that. I suspect it's the total score 
including this rule. I also don't recall offhand which scoreset of the 
four possible that the score here is based on. It may be the non-net
scoreset for the regular weekly runs and the net scoreset for the net run 
on the weekend, but I don't know whether its the bayes or non-bayes 
variant.

The percentage should be obvious, the asterisks are a visual 
representation of that.

The final number is the total number of messages that hit at that score.

For example, this rule hit 17791 spams scored at 4 points, which was 
90.24% of the total spam hits.

Based on the above, this rule is helping detect low-scoring spams, but a 
little more is still needed to push them over the threshold. *Potentially* 
that would be increasing the score of this rule, but it's already at ~3.5 
points and bumping it any higher is edging into "poison pill" territory, 
which is generally a bad idea (except for rules that are very high S/O on 
malware, in which case yes, poison away!).

>> It's not so much the raw amount of spam it hits, it's that it hits spam
>> that few other rules hit, or that it hits spam that other rules hit but
>> that doesn't score high enough with those other rules.
>
> I searched my pile of mail that I have from two ice ages ago, and I did
> find 6 messages that were hits of this rule, one of them was spam, five
> of them were this person trying to contact me.

...without a subject?

>> Do you happen to be seeing FPs with this rule?
>
> Yes, its why I am investigating it. I think it is common for people who
> are sending mail from their mobiles, where they use it more like a quick
> chat instead of a 'regular mail'....
>
> In fact, this person used:
> X-Mailer: iPad Mail (15F79)

OK, I can see about adding some mobile MUA exclusions. Any FP headers you 
can provide (directly) will be helpful. Go ahead and sanitize the 
recipient info, I don't think that would be relevant to tuning this one.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Efficiency can magnify good, but it magnifies evil just as well.
   So, we should not be surprised to find that modern electronic
   communication magnifies stupidity as *efficiently* as it magnifies
   intelligence.                                   -- Robert A. Matern
-----------------------------------------------------------------------
  Tomorrow: the 73rd anniversary of the end of World War II

Re: Understanding ruleQA results

Posted by micah anderson <mi...@riseup.net>.
John Hardin <jh...@impsec.org> writes:

> On Tue, 14 Aug 2018, micah anderson wrote:
>
>> but how can I tell how many messages are part of the corpus?
>
> As RW said, hover over the percentages.

Thanks.

>> Also, the percentages seem very low: 1.5192% Spam, and .0005%
>> Ham... 1.5% seems low to me to be adding 3.5 score to this rule, but
>> what do I know... which is why I'm asking.
>
> It's not so much the raw amount of spam it hits, it's that it hits spam 
> that few other rules hit, or that it hits spam that other rules hit but 
> that doesn't score high enough with those other rules.
>
> You also want to look at the score-map section when evaluating a rule.

Is there an explanation of the score-map section somewhere?

For this one it says:

  scoremap  ham:  0  33.33%    1 *************
  scoremap  ham:  1  66.67%    2 **************************
  scoremap spam:  1   0.08%   15 
  scoremap spam:  3   0.61%  121 
  scoremap spam:  4  90.24% 17791 ************************************
  scoremap spam:  5   2.69%  531 *
  scoremap spam:  6   4.54%  896 *
  scoremap spam:  7   1.10%  217 
  scoremap spam:  8   0.26%   52 
  scoremap spam:  9   0.40%   79 
  scoremap spam: 10   0.01%    2 
  scoremap spam: 11   0.05%    9 
  scoremap spam: 14   0.01%    2 

What are these columns and how can I interpret it?

> It's not so much the raw amount of spam it hits, it's that it hits spam 
> that few other rules hit, or that it hits spam that other rules hit but 
> that doesn't score high enough with those other rules.

I searched my pile of mail that I have from two ice ages ago, and I did
find 6 messages that were hits of this rule, one of them was spam, five
of them were this person trying to contact me. 

> Do you happen to be seeing FPs with this rule?

Yes, its why I am investigating it. I think it is common for people who
are sending mail from their mobiles, where they use it more like a quick
chat instead of a 'regular mail'....

In fact, this person used:
X-Mailer: iPad Mail (15F79)


-- 
        micah

Re: Understanding ruleQA results

Posted by John Hardin <jh...@impsec.org>.
On Tue, 14 Aug 2018, micah anderson wrote:

> I'm trying to understand the ruleQA results because I'm trying to track
> down how common the rule FRNAME_IN_MSG_NO_SUBJ is spammy.
>
> I load the latest rules: http://ruleqa.spamassassin.org/20180813-r1837926-n/FRNAME_IN_MSG_NO_SUBJ/detail?s_corpus=1&s_g_over_time=1#overtime

That run only has three masscheck corpora. You might want to look earlier 
or later to a run that has more, for example:

http://ruleqa.spamassassin.org/20180814-r1837997-n/FRNAME_IN_MSG_NO_SUBJ/detail

> and I see the S/O value is 1.0, which is a rule that hits only on spam

Or close enough that rounding hides the ham hits.

> (a rule that only hits on ham is 0.0, a rule that doesn't anything is
> 0.5)...

> but how can I tell how many messages are part of the corpus?

As RW said, hover over the percentages.

> Also, the percentages seem very low: 1.5192% Spam, and .0005%
> Ham... 1.5% seems low to me to be adding 3.5 score to this rule, but
> what do I know... which is why I'm asking.

It's not so much the raw amount of spam it hits, it's that it hits spam 
that few other rules hit, or that it hits spam that other rules hit but 
that doesn't score high enough with those other rules.

You also want to look at the score-map section when evaluating a rule.

I don't care when a rule hits a lot of spam scoring 20+ points. I care a 
lot if it hits spams that score 1-4 points.


Do you happen to be seeing FPs with this rule?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   North Korea: the only country in the world where people would risk
   execution to flee to communist China.                  -- Ride Fast
-----------------------------------------------------------------------
  Tomorrow: the 73rd anniversary of the end of World War II

Re: Understanding ruleQA results

Posted by RW <rw...@googlemail.com>.
On Tue, 14 Aug 2018 11:38:27 -0400
micah anderson wrote:

> Hi,
> 
> I'm trying to understand the ruleQA results because I'm trying to
> track down how common the rule FRNAME_IN_MSG_NO_SUBJ is spammy.
> 
> I load the latest rules:
> http://ruleqa.spamassassin.org/20180813-r1837926-n/FRNAME_IN_MSG_NO_SUBJ/detail?s_corpus=1&s_g_over_time=1#overtime
> 
> and I see the S/O value is 1.0, which is a rule that hits only on spam
> (a rule that only hits on ham is 0.0, a rule that doesn't anything is
> 0.5)... but how can I tell how many messages are part of the corpus?


'mouseover' the percentages

 
> Also, the percentages seem very low: 1.5192% Spam, and .0005%
> Ham... 1.5% seems low to me to be adding 3.5 score to this rule,

 

The only reason that that might be a problem is if all the hits
occurred in a single short period, which would suggest it's a property
of a single spam run. Other than that the scores come from an
optimization process. You can see why it gets a large score just by
looking at the score-map section.