You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by da...@chaosreigns.com on 2012/10/26 18:18:49 UTC

Masscheck Re: Question about rule: 2.0 DEAR_SOMETHING BODY: Contains 'Dear (something)'

On 10/26, Alexandre Boyer wrote:
> Well, discouraged was implicit (as is the fact that every admin is

I don't think there's anything implicit about it being discouraged to use a
threshold below 5.  There are lots of local changes which are far less
likely to cause problems, and encouraged.

> The SA rules scores are computed based on the mass-checks, from the
> project and, to some extend, from contributors. A good question is: how
> many contributors really give a feedback on the mass-checks?

This is public information, although not very explicit.
On http://ruleqa.spamassassin.org/ look in the green box, it lists all the
corpora included:

  axb-coi-bulk
  axb-fraud
  axb-generic
  axb-ham-misc
  axb-sa-users
  axb-woas
  bb-guenther_fraud
  bb-jhardin
  bb-jhardin_fraud
  bb-jm
  bb-kmcgrail
  bb-zmi
  bpoliakoff
  danmcdonald
  darxus
  grenier
  jarif
  kpg-gah
  mas
  zmi

The ones starting with "bb-" are uploaded emails, instead of running
masscheck locally, it's run centrally.  Other than that, the prefixes are
each different contribtors.  So:

axb, guenther, jhardin, jm, kmcgrail, zmi, bpoliakoff, danmcdonald,
darxus, grenier, kpg-gah, kpg, mas, zmi.

14 masscheck contributors.  We'd probably benefit a lot by significantly
increasing that, which is why I mention it somewhat often.

> This is something I do not know, but the fewer they are, the greater the
> bias is. Bias in spam and ham samples. Emails reaching my servers are
> different from yours and from each and every SA users.

Absolutely.

> Unless everybody on earth run a nightly mass-check and report results to
> SA project for it to compute a "world wide" scoring, there is a bias. At
> least this is my understanding, may be I'm wrong, please correct me if so.

No, you're totally right.  We do what we can with what we have, and I think
we do pretty darn good.  But we could do better with more data.  

> For example, I'm in the process of learning to use mass-check to
> contribute back to SA (which implies a lot of hard work, simply to build
> and maintain valid ham/spam corpora, use mass-check, then hit-freq, then
> fp-fn-stat, I'm not even close to understand how to compute a re-score.

I don't know what fp-fn-stat is.  You don't need to computer a re-score -
that's part of what is done with your maccheck data after you upload it.

There's a reletively recently created mailing list specifically for helping
people with this stuff, to which I believe you automatically get subscribed
when you get a masscheck account:
http://wiki.apache.org/spamassassin/MailingLists#RuleQA

If you're having difficulty with it, the docs probably need improvement, so
do let us know.


Your mention of fp-fn-stat makes me think you may have veered a little too
far from https://wiki.apache.org/spamassassin/NightlyMassCheck

> with this, I'm not sure my contribution would be sufficient to make SA
> scores to be closer to my email traffic reality.

I think it would.  For example, I'm sure, from what you've posted, that you
have enough examples of hams that hit DEAR_SOMETHING that the score of it
would drop significantly.

> Do you have any stat about how many contributors are giving a feedback
> on the masscheck? and about their geographical location? I'm just asking
> because I was not able to find this kind of information anywhere.

I believe they're almost all in the US, primarily English speakers.  That's
bad.

-- 
"You only truly own what you can carry at a dead run."
- 14th & 15th century Landsknechts
http://www.ChaosReigns.com

Re: Masscheck Re: Question about rule: 2.0 DEAR_SOMETHING BODY: Contains 'Dear (something)'

Posted by Alexandre Boyer <bi...@gmail.com>.
Alex, from osmose.
Bow before me, for I am root.

On 12-10-26 12:18 PM, darxus@chaosreigns.com wrote:
> On 10/26, Alexandre Boyer wrote:
>> Well, discouraged was implicit (as is the fact that every admin is
> I don't think there's anything implicit about it being discouraged to use a
> threshold below 5.  There are lots of local changes which are far less
> likely to cause problems, and encouraged.
>
>> The SA rules scores are computed based on the mass-checks, from the
>> project and, to some extend, from contributors. A good question is: how
>> many contributors really give a feedback on the mass-checks?
> This is public information, although not very explicit.
> On http://ruleqa.spamassassin.org/ look in the green box, it lists all the
> corpora included:
>
>   axb-coi-bulk
>   axb-fraud
>   axb-generic
>   axb-ham-misc
>   axb-sa-users
>   axb-woas
>   bb-guenther_fraud
>   bb-jhardin
>   bb-jhardin_fraud
>   bb-jm
>   bb-kmcgrail
>   bb-zmi
>   bpoliakoff
>   danmcdonald
>   darxus
>   grenier
>   jarif
>   kpg-gah
>   mas
>   zmi
>
> The ones starting with "bb-" are uploaded emails, instead of running
> masscheck locally, it's run centrally.  Other than that, the prefixes are
> each different contribtors.  So:
>
> axb, guenther, jhardin, jm, kmcgrail, zmi, bpoliakoff, danmcdonald,
> darxus, grenier, kpg-gah, kpg, mas, zmi.
>
> 14 masscheck contributors.  We'd probably benefit a lot by significantly
> increasing that, which is why I mention it somewhat often.
>
>> This is something I do not know, but the fewer they are, the greater the
>> bias is. Bias in spam and ham samples. Emails reaching my servers are
>> different from yours and from each and every SA users.
> Absolutely.
>
>> Unless everybody on earth run a nightly mass-check and report results to
>> SA project for it to compute a "world wide" scoring, there is a bias. At
>> least this is my understanding, may be I'm wrong, please correct me if so.
> No, you're totally right.  We do what we can with what we have, and I think
> we do pretty darn good.  But we could do better with more data.  
>
>> For example, I'm in the process of learning to use mass-check to
>> contribute back to SA (which implies a lot of hard work, simply to build
>> and maintain valid ham/spam corpora, use mass-check, then hit-freq, then
>> fp-fn-stat, I'm not even close to understand how to compute a re-score.
> I don't know what fp-fn-stat is.  You don't need to computer a re-score -
> that's part of what is done with your maccheck data after you upload it.

I replied AXB about this, it's my problem, nothing to do with SA in
itself ;)

fp-fn-statistics is a script in sa-trunk/masses that is telling you how
much fps and fns you're doing given your mass-check data and
hit-frequencies. You can then choose which score-set to use and which
threshold.

Very useful indeed, especially for those who have tons of personal rules.

>
> There's a reletively recently created mailing list specifically for helping
> people with this stuff, to which I believe you automatically get subscribed
> when you get a masscheck account:
> http://wiki.apache.org/spamassassin/MailingLists#RuleQA

I will subscribe sooner or later. It all depends of my other problems,
you know, maintaining dozens of servers in operation. Not a big deal but
time consuming.

>
> If you're having difficulty with it, the docs probably need improvement, so
> do let us know.

Up to the mass-check, I've got nothing to complain about. The doc is
pretty clear (while and introduction to SA in general and mass-check in
particular could not harm). On the contrary, doc is missing for the next
steps (when one want/need to use the other scripts in sa-trunk/masses,
forcing you to read the code and take guesses about script's purpose and
in which order they should be used).

>
>
> Your mention of fp-fn-stat makes me think you may have veered a little too
> far from https://wiki.apache.org/spamassassin/NightlyMassCheck

I will certainly stick to this as per my (later) contrib to SA.

>
>> with this, I'm not sure my contribution would be sufficient to make SA
>> scores to be closer to my email traffic reality.
> I think it would.  For example, I'm sure, from what you've posted, that you
> have enough examples of hams that hit DEAR_SOMETHING that the score of it
> would drop significantly.
>
>> Do you have any stat about how many contributors are giving a feedback
>> on the masscheck? and about their geographical location? I'm just asking
>> because I was not able to find this kind of information anywhere.
> I believe they're almost all in the US, primarily English speakers.  That's
> bad.
>
Sure it is: SA do pretty damn good, but for those of us French speaking,
or Spanish speaking, a lot of FUZZY_something rules (you know, those
with replace tags) are totally inappropriate (well I mean the scores).
FUZZY_AMBIEN for example is a pain in the blip. It matches "combien" or
"tan bien", very common and not spammy terms in French and Spanish (I
guess that if you run a small instance of SA you may compensate with
bayes learning).

Sure a big corpus with those languages (like the one I'm building) could
lower their score, but will it be enough? Time will tell I guess ;-)

After reading your answer to my late post: 150.000 spams is easy to get.
I think a week could be enough. While 2-3 weeks would give greater
variety. As for ham messages, it's longer because I'm only interested in
my false positives, which are somewhat rare. The rescoring step seemed
to be important to me because of my threshold and the quantity of
personal rules, but maybe have I to think about my strategy twice. 6
years for ham TTL? this is not crazy, this is absurd. But given the
difficulty to collect ham messages, I certainly understand that very well.

Thank you for sharing your know-how and opinions Darxus, it's much
helpful and appreciated!