You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by John Hardin <jh...@impsec.org> on 2014/04/05 18:42:21 UTC

Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Could someone who understands the scoring logic used by the perceptron or 
GA please comment on why this rule (and others like it) are only being 
scored at 0.01?

http://ruleqa.spamassassin.org/20140404-r1584563-n/T_DX_TEXT_02/detail

I would think that a rule which hits nothing but spam (S/O 1.00), and 
whose hits are 70% on spam scoring below 5 points, would be scored at 2 or 
3 points regardless of how many actual hits it gets...

Does it just take some time for the perceptron to get "primed" and start 
scoring rules once the corpora are of sufficient size? Because there are 
older rules with similar profiles that are being scored.

I've observed that a lot of high-S/O rules that hit well on low-scoring 
spam but that don't necessarily hit a lot of spam are assigned very low 
scores, such that they don't appear to help much in pushing those 
low-scoring spams towards the threshold. Many aren't being scored at all 
and thus aren't being published.

I haven't started digging into the scoring code yet; is there some bias 
based on the number of overall hits a rule gets, or the highest score on 
messages the rule hits, that would tend to impose a seemingly unreasonably 
low limit on the generated score?

I'd rather not have to resort to hitting the masscheck system over the 
head with the "tflags publish" cluebat, but I will if it keeps ignoring 
these rules.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The difference is that Unix has had thirty years of technical
   types demanding basic functionality of it. And the Macintosh has
   had fifteen years of interface fascist users shaping its progress.
   Windows has the hairpin turns of the Microsoft marketing machine
   and that's all.                                    -- Red Drag Diva
-----------------------------------------------------------------------
  8 days until Thomas Jefferson's 271st Birthday

Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by Axb <ax...@gmail.com>.
On 04/05/2014 06:59 PM, Axb wrote:
> If Darxus sees so much of this type, why isn't he running a masschecker?

opps. sorry- I hand't seen he is indeed participating.




Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 4/7/2014 11:03 AM, John Hardin wrote:
> On Mon, 7 Apr 2014, Kevin A. McGrail wrote:
>
>> On 4/5/2014 12:59 PM, Axb wrote:
>>>  On 04/05/2014 06:42 PM, John Hardin wrote:
>>> >  I'd rather not have to resort to hitting the masscheck system 
>>> over the
>>> >  head with the "tflags publish" cluebat, but I will if it keeps 
>>> ignoring
>>> >  these rules.
>>>
>>>  this would by very unwise and would create rule bloat as obviosuly the
>>>  corpus isn't seeing much spams with whatever pattern you'd wan to 
>>> publish.
>>
>> According to the wiki, the tflags publish is required to publish 
>> rules: rules without an explicit "tflags publish" line are never 
>> published
>>
>> http://wiki.apache.org/spamassassin/SaUpdateBackend
>
> Unless "tflags publish" is the default, that doesn't seem to be the 
> current behavior. Many of my rules do not have an explicit "tflags 
> publish" on them yet they are being published - for example, 
> TO_NO_BRKTS_MSFT
>
Can't disagree but pointing out that I use tflags publish because 
according to the docs you are supposed to...

Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 4/7/2014 11:26 AM, John Hardin wrote:
> On Mon, 7 Apr 2014, Axb wrote:
>
>> On 04/07/2014 05:03 PM, John Hardin wrote:
>>>  On Mon, 7 Apr 2014, Kevin A. McGrail wrote:
>>>
>>> >  On 4/5/2014 12:59 PM, Axb wrote:
>>> > >   On 04/05/2014 06:42 PM, John Hardin wrote:
>>> > > >   I'd rather not have to resort to hitting the masscheck 
>>> system over
>>> > >  the
>>> > > >   head with the "tflags publish" cluebat, but I will if it keeps
>>> > >  ignoring
>>> > > >   these rules.
>>> > > > >   this would by very unwise and would create rule bloat as 
>>> obviosuly > >   the
>>> > >   corpus isn't seeing much spams with whatever pattern you'd wan to
>>> > >  publish.
>>> > >  According to the wiki, the tflags publish is required to publish
>>> >  rules: rules without an explicit "tflags publish" line are never
>>> >  published
>>> > >  http://wiki.apache.org/spamassassin/SaUpdateBackend
>>>
>>>  Unless "tflags publish" is the default, that doesn't seem to be the
>>>  current behavior. Many of my rules do not have an explicit "tflags
>>>  publish" on them yet they are being published - for example,
>>>  TO_NO_BRKTS_MSFT
>>>
>>
>> iirc,  /trunk/rulesrc/10_force_active.cf was used for that
>>
>> # #  Force some sandbox rules to be active, since they have scores 
>> assigned
>> #  by the GA/Perceptron evolver.  If you want to remove a rule from
>> #  this list, be sure to remove it's 'score' line in rules/50_scores.cf
>> #  too.
>> #
>
> TO_NO_BRKTS_MSFT does not have a "publish" anywhere in SVN.
Agreed.  I'll take a look and see why.

Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by John Hardin <jh...@impsec.org>.
On Mon, 7 Apr 2014, Axb wrote:

> On 04/07/2014 05:03 PM, John Hardin wrote:
>>  On Mon, 7 Apr 2014, Kevin A. McGrail wrote:
>> 
>> >  On 4/5/2014 12:59 PM, Axb wrote:
>> > >   On 04/05/2014 06:42 PM, John Hardin wrote:
>> > > >   I'd rather not have to resort to hitting the masscheck system over
>> > >  the
>> > > >   head with the "tflags publish" cluebat, but I will if it keeps
>> > >  ignoring
>> > > >   these rules.
>> > > 
>> > >   this would by very unwise and would create rule bloat as obviosuly 
>> > >   the
>> > >   corpus isn't seeing much spams with whatever pattern you'd wan to
>> > >  publish.
>> > 
>> >  According to the wiki, the tflags publish is required to publish
>> >  rules: rules without an explicit "tflags publish" line are never
>> >  published
>> > 
>> >  http://wiki.apache.org/spamassassin/SaUpdateBackend
>>
>>  Unless "tflags publish" is the default, that doesn't seem to be the
>>  current behavior. Many of my rules do not have an explicit "tflags
>>  publish" on them yet they are being published - for example,
>>  TO_NO_BRKTS_MSFT
>> 
>
> iirc,  /trunk/rulesrc/10_force_active.cf was used for that
>
> # 
> #  Force some sandbox rules to be active, since they have scores assigned
> #  by the GA/Perceptron evolver.  If you want to remove a rule from
> #  this list, be sure to remove it's 'score' line in rules/50_scores.cf
> #  too.
> #

TO_NO_BRKTS_MSFT does not have a "publish" anywhere in SVN.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Show me somebody who waxes poetic about "Being at one with Nature"
   and I'll show you someone who hasn't figured out that Nature is an
   infinite stomach demanding to be fed.     -- Atomic, at Wapsi forum
-----------------------------------------------------------------------
  6 days until Thomas Jefferson's 271st Birthday

Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by Axb <ax...@gmail.com>.
On 04/07/2014 05:03 PM, John Hardin wrote:
> On Mon, 7 Apr 2014, Kevin A. McGrail wrote:
>
>> On 4/5/2014 12:59 PM, Axb wrote:
>>>  On 04/05/2014 06:42 PM, John Hardin wrote:
>>> >  I'd rather not have to resort to hitting the masscheck system over
>>> the
>>> >  head with the "tflags publish" cluebat, but I will if it keeps
>>> ignoring
>>> >  these rules.
>>>
>>>  this would by very unwise and would create rule bloat as obviosuly the
>>>  corpus isn't seeing much spams with whatever pattern you'd wan to
>>> publish.
>>
>> According to the wiki, the tflags publish is required to publish
>> rules: rules without an explicit "tflags publish" line are never
>> published
>>
>> http://wiki.apache.org/spamassassin/SaUpdateBackend
>
> Unless "tflags publish" is the default, that doesn't seem to be the
> current behavior. Many of my rules do not have an explicit "tflags
> publish" on them yet they are being published - for example,
> TO_NO_BRKTS_MSFT
>

iirc,  /trunk/rulesrc/10_force_active.cf was used for that

#
# Force some sandbox rules to be active, since they have scores assigned
# by the GA/Perceptron evolver.  If you want to remove a rule from
# this list, be sure to remove it's 'score' line in rules/50_scores.cf
# too.
#

Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by John Hardin <jh...@impsec.org>.
On Mon, 7 Apr 2014, Kevin A. McGrail wrote:

> On 4/5/2014 12:59 PM, Axb wrote:
>>  On 04/05/2014 06:42 PM, John Hardin wrote:
>> >  I'd rather not have to resort to hitting the masscheck system over the
>> >  head with the "tflags publish" cluebat, but I will if it keeps ignoring
>> >  these rules.
>>
>>  this would by very unwise and would create rule bloat as obviosuly the
>>  corpus isn't seeing much spams with whatever pattern you'd wan to publish.
>
> According to the wiki, the tflags publish is required to publish rules: rules 
> without an explicit "tflags publish" line are never published
>
> http://wiki.apache.org/spamassassin/SaUpdateBackend

Unless "tflags publish" is the default, that doesn't seem to be the 
current behavior. Many of my rules do not have an explicit "tflags 
publish" on them yet they are being published - for example, 
TO_NO_BRKTS_MSFT

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...every time I sit down in front of a Windows machine I feel as
   if the computer is just a place for the manufacturers to put their
   advertising.                                 -- fwadling on Y! SCOX
-----------------------------------------------------------------------
  6 days until Thomas Jefferson's 271st Birthday

Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 4/5/2014 12:59 PM, Axb wrote:
> On 04/05/2014 06:42 PM, John Hardin wrote:
>> I'd rather not have to resort to hitting the masscheck system over the
>> head with the "tflags publish" cluebat, but I will if it keeps ignoring
>> these rules.
>
> this would by very unwise and would create rule bloat as obviosuly the 
> corpus isn't seeing much spams with whatever pattern you'd wan to 
> publish.
>
> If the corpus is pathetically small then the results reflect this or 
> the rulex X only applies to very specific traffic which is not 
> representative.
>
> The idea of the GA is to conservatively publish rules which are useful 
> on a global basis. Bypassing this mechanism seems to defeat GA and we 
> might as well stop using it. 
According to the wiki, the tflags publish is required to publish rules: 
rules without an explicit "tflags publish" line are never published

http://wiki.apache.org/spamassassin/SaUpdateBackend

Regards,
KAM

Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by John Hardin <jh...@impsec.org>.
On Sat, 5 Apr 2014, John Hardin wrote:

> On Sat, 5 Apr 2014, Axb wrote:
>
>>  On 04/05/2014 07:33 PM, John Hardin wrote:
>> 
>> >   The masscheck spam corpus isn't pathetically small, but at the moment
>> >   it's *strongly* biased towards the traffic *you* are seeing. Your spam
>> >   is 490k+ of the 510k total corpus.
>>
>>  Should I feel guilty for only masschecking the last 21 days?
>
> No, certainly not. But I did want to point out that the corpus is biased at 
> the moment.

Let me amend that: I don't have any idea how diverse your corpora feeds 
are, so it's entirely possible that your providing the bulk of masscheck 
spam recently isn't actually causing any bias in the results.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Where are my space habitats? Where is my flying car?
   It's 2010 and all I got from the SF books of my youth
   is the lousy dystopian government.                      -- perlhaqr
-----------------------------------------------------------------------
  8 days until Thomas Jefferson's 271st Birthday

Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by John Hardin <jh...@impsec.org>.
On Sat, 5 Apr 2014, Axb wrote:

> On 04/05/2014 07:33 PM, John Hardin wrote:
>
>>  The masscheck spam corpus isn't pathetically small, but at the moment
>>  it's *strongly* biased towards the traffic *you* are seeing. Your spam
>>  is 490k+ of the 510k total corpus.
>
> Should I feel guilty for only masschecking the last 21 days?

No, certainly not. But I did want to point out that the corpus is biased 
at the moment.

>>  That was only an example. There are other rules for spam that I'm
>>  receiving, and I have some contact with a fairly large ISP that has been
>>  seeing similar traffic and reporting FNs to me, but the rules aren't
>>  doing well in masscheck.
>>  My personal message traffic is pretty small, and I don't know whether
>>  the ISP can devote any resources to performing masschecks.
>
> I've offered to run masschecks if ppl can't setup themselves but if I don't 
> get the data...

The problem there is privacy issues (for ham, at least).

> DNSWL was feeding me a spam trickle but that has dissapeared as well.

Mark Perkel keeps offering his data... :)

>>  I've been considering publishing a separate rules feed for
>>  apparently-useful rules like this that masscheck doesn't seem to
>>  consider worthy, I may have to consider that idea more seriously.
>
> I'm personally in favour of ppl running separate repositories, a la SARE, but 
> that seems against the project's aims.
>
>>  For the moment, though, I think I will "tflags publish" a couple of my
>>  recent high-S/O rules. I wasn't proposing doing it en masse.
>
> imo, if we all start doing this for a couple of rules which perform well in a 
> small eco system the collective turns into "en masse".

True. :(

> but then... go for it

I can always turn them back off.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Where are my space habitats? Where is my flying car?
   It's 2010 and all I got from the SF books of my youth
   is the lousy dystopian government.                      -- perlhaqr
-----------------------------------------------------------------------
  8 days until Thomas Jefferson's 271st Birthday

Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by Axb <ax...@gmail.com>.
On 04/05/2014 07:33 PM, John Hardin wrote:

> The masscheck spam corpus isn't pathetically small, but at the moment
> it's *strongly* biased towards the traffic *you* are seeing. Your spam
> is 490k+ of the 510k total corpus.

Should I feel guilty for only masschecking the last 21 days?


> That was only an example. There are other rules for spam that I'm
> receiving, and I have some contact with a fairly large ISP that has been
> seeing similar traffic and reporting FNs to me, but the rules aren't
> doing well in masscheck.
> My personal message traffic is pretty small, and I don't know whether
> the ISP can devote any resources to performing masschecks.

I've offered to run masschecks if ppl can't setup themselves but if I 
don't get the data...
DNSWL was feeding me a spam trickle but that has dissapeared as well.

> I've been considering publishing a separate rules feed for
> apparently-useful rules like this that masscheck doesn't seem to
> consider worthy, I may have to consider that idea more seriously.

I'm personally in favour of ppl running separate repositories, a la 
SARE, but that seems against the project's aims.

> For the moment, though, I think I will "tflags publish" a couple of my
> recent high-S/O rules. I wasn't proposing doing it en masse.

imo, if we all start doing this for a couple of rules which perform well 
in a small eco system the collective turns into "en masse".

but then... go for it


Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by John Hardin <jh...@impsec.org>.
On Sat, 5 Apr 2014, Axb wrote:

> On 04/05/2014 06:42 PM, John Hardin wrote:
>>  I'd rather not have to resort to hitting the masscheck system over the
>>  head with the "tflags publish" cluebat, but I will if it keeps ignoring
>>  these rules.
>
> this would by very unwise and would create rule bloat as obviosuly the corpus 
> isn't seeing much spams with whatever pattern you'd wan to publish.
>
> If the corpus is pathetically small then the results reflect this or the 
> rulex X only applies to very specific traffic which is not representative.

The masscheck spam corpus isn't pathetically small, but at the moment it's 
*strongly* biased towards the traffic *you* are seeing. Your spam is 490k+ 
of the 510k total corpus.

> The idea of the GA is to conservatively publish rules which are useful on a 
> global basis. Bypassing this mechanism seems to defeat GA and we might as 
> well stop using it.
>
> If Darxus sees so much of this type, why isn't he running a masschecker?

That was only an example. There are other rules for spam that I'm 
receiving, and I have some contact with a fairly large ISP that has been 
seeing similar traffic and reporting FNs to me, but the rules aren't doing 
well in masscheck.

My personal message traffic is pretty small, and I don't know whether the 
ISP can devote any resources to performing masschecks.

I've been considering publishing a separate rules feed for 
apparently-useful rules like this that masscheck doesn't seem to consider 
worthy, I may have to consider that idea more seriously.

For the moment, though, I think I will "tflags publish" a couple of my 
recent high-S/O rules. I wasn't proposing doing it en masse.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Maxim IV: Close air support covereth a multitude of sins.
-----------------------------------------------------------------------
  8 days until Thomas Jefferson's 271st Birthday

Re: Perceptron/GA logic w/r/t low-scoring high-S/O rules?

Posted by Axb <ax...@gmail.com>.
On 04/05/2014 06:42 PM, John Hardin wrote:
> I'd rather not have to resort to hitting the masscheck system over the
> head with the "tflags publish" cluebat, but I will if it keeps ignoring
> these rules.

this would by very unwise and would create rule bloat as obviosuly the 
corpus isn't seeing much spams with whatever pattern you'd wan to publish.

If the corpus is pathetically small then the results reflect this or the 
rulex X only applies to very specific traffic which is not representative.

The idea of the GA is to conservatively publish rules which are useful 
on a global basis. Bypassing this mechanism seems to defeat GA and we 
might as well stop using it.

If Darxus sees so much of this type, why isn't he running a masschecker?