You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Karsten Bräckelmann <gu...@rudersport.de> on 2011/02/04 04:15:44 UTC

Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf

On Thu, 2011-02-03 at 16:35 -1000, Warren Togami Jr. wrote:
> I am not sure who "smf" is, but are they aware that "test rules" without 
> "tflags nopublish" could possibly be auto-promoted with a score they 
> don't expect.  Scores in the sandbox below are ignored.  We don't want a 
> repeat of the FSL_RU_URL issue.

Getting *cough* late here, and I'm terribly tired...

Warren, IMHO, please feel free to add the tflags necessary. Yes, inside
someone else's sandbox. I guess the recent discussion about rules
possibly spreading unintended didn't reach everyone. Although it won't
right now, as we know -- but the corpora being under-limit is another
topic, unrelated to best practices. Adding tflags nopublish is not
intrusive anyway, and warranted in this case, I guess.

Definitely, good catch, Warren.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf

Posted by "Warren Togami Jr." <wt...@gmail.com>.

On 2/9/2011 2:25 PM, Steve Freegard wrote:
>> What plugin do you have in mind? If it is your URI shortener plugin I
>> would have to strongly protest. Its current design is good in theory
>> but it will never survive in mass production. It has the capability of
>> making spamassassin far too slow while it is too easy to bypass.
>
> No - I wouldn't dream of putting the full short URI decoder into the
> mass-checks in it's current form; but I would put a version that simply
> reported when short URIs were detected (BTW - the rules that do this in
> your sandbox need to backslash escape the dots) as it's more efficient
> to do this in a plug-in with easily expandable list than with a URI rule
> and massive regexps IMO. We should have these sorts of 'instrumentation'
> rules in the corpus that tell us when a particular abuse vector is being
> more heavily exploited and allows them to be used in meta rules.
>
> Anyway - I was talking in more general terms; I have some ideas for a
> few other plugins.
>
> Regards,
> Steve.

Go ahead and delete my crappy huge shortener rule after you have 
something better to replace it.

http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/wtogami/README
Here is my open policy.  I invite other sandboxers to have their own.

Warren

Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf

Posted by Henrik Krohns <he...@hege.li>.

On Thu, Feb 10, 2011 at 12:25:17AM +0000, Steve Freegard wrote:
> 
> as it's more efficient to do this in a plug-in with easily expandable list
> than with a URI rule and massive regexps IMO.

Nah. Given the tiny size of the list, it's far more efficient to use a
single large regexp than going through the work using plugin that iterates
URIs internally and still has to match the stuff somehow (regexp or hashes).

Of course you can use a simple script to generate the rule from a list using
Regexp::Assemble.

Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf

Posted by Steve Freegard <st...@stevefreegard.com>.

  On 09/02/11 23:31, Warren Togami Jr. wrote:
> On 2/4/2011 7:40 AM, Steve Freegard wrote:
>> My apologies for the rookie mistake. As a recent addition to the team I
>> missed the original discussion on requiring 'tflags nopublish' for all
>> sandbox rules.
>>
>> Can someone point me at the Bug that discussed this and I'll update the
>> RuleSandboxes Wiki entry to reflect this.
>
> Currently this is only among our "tribal knowledge".  The Wiki needs a 
> bit of reorg to put this stuff into a logical place.
>

Ok - I'll update the RuleSandboxes entry.  Re-organizing the wiki is a 
totally separate issue.

>>
>> While I'm on the subject of Sandboxes; what are the rules for adding
>> DNSBLs to it for testing? Specifically I would like to add the fresh,
>> fresh10 and fresh15 lists from spameatingmonkey.com for testing as my
>> own local testing found them to be more effective and faster than the
>> day-old-bread list that is currently in the core ruleset.
>
> A critical flaw here is the only way to truly measure its performance 
> in masscheck is with "reuse", which means the ham/spam corpora must be 
> tagged with this custom rule during delivery and masscheck only 
> records yes/no from the existing spamassassin headers.  It is a waste 
> of resources and with misleading results to test these rules on FRESH 
> type rules on older mail.
>

Unfortunately I can see many issues with requiring that all corpora must 
contain SpamAssassin mark-up at the time it was received; this simply 
isn't practical for some people (myself included as my trap data comes 
from lots of different sources).

Would this not also be dangerous if the SA headers in the message were 
generated by an older version of SA or one that had been modified or 
were running with local-tests-only or intentionally or unintentionally 
poisoned to skew the results?

Blacklist effectiveness for things like day-old-bread and fresh5/10/15 
can be checked by looking at the message age versus the hit percentage 
which IIRC the ruleqa app can do when the rule detail is shown.

This brings me onto another subject:

One thing that strikes me as missing from the ruleqa app is a way to 
show the overall score distribution for each submitter for ham/spam and 
the overall number of false positives and false negatives in each corpus 
(and a total).  This would be to check two things:

1)  The overall corpus quality (e.g. lots of low or negative scoring 
mail in a spam corpus would potentially cause concern as with ham and 
very high scores).

2)  The overall effectiveness of the current rules against the corpora.

The focus should then be on writing rules and plug-ins to deal with the 
extremes e.g. the false-positives or false-negatives present and reduce 
the overall percentages of both classes.

I'm doing this myself by parsing the local mass-check logs and copying 
any false-positives or false-negatives into specific directories so that 
I can sort through them and update rules as necessary.

>>
>> And finally; I can't find any information in the Wiki about adding
>> plugins to a sandbox, specifically if this is allowed and what the
>> loadplugin line should be to get it to load correctly during the
>> mass-checks.
>
> What plugin do you have in mind?  If it is your URI shortener plugin I 
> would have to strongly protest.  Its current design is good in theory 
> but it will never survive in mass production.  It has the capability 
> of making spamassassin far too slow while it is too easy to bypass.

No - I wouldn't dream of putting the full short URI decoder into the 
mass-checks in it's current form; but I would put a version that simply 
reported when short URIs were detected (BTW - the rules that do this in 
your sandbox need to backslash escape the dots) as it's more efficient 
to do this in a plug-in with easily expandable list than with a URI rule 
and massive regexps IMO.  We should have these sorts of 
'instrumentation' rules in the corpus that tell us when a particular 
abuse vector is being more heavily exploited and allows them to be used 
in meta rules.

Anyway - I was talking in more general terms; I have some ideas for a 
few other plugins.

Regards,
Steve.

Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf

Posted by "Warren Togami Jr." <wt...@gmail.com>.

On 2/4/2011 7:40 AM, Steve Freegard wrote:
> My apologies for the rookie mistake. As a recent addition to the team I
> missed the original discussion on requiring 'tflags nopublish' for all
> sandbox rules.
>
> Can someone point me at the Bug that discussed this and I'll update the
> RuleSandboxes Wiki entry to reflect this.

Currently this is only among our "tribal knowledge".  The Wiki needs a 
bit of reorg to put this stuff into a logical place.

>
> While I'm on the subject of Sandboxes; what are the rules for adding
> DNSBLs to it for testing? Specifically I would like to add the fresh,
> fresh10 and fresh15 lists from spameatingmonkey.com for testing as my
> own local testing found them to be more effective and faster than the
> day-old-bread list that is currently in the core ruleset.

A critical flaw here is the only way to truly measure its performance in 
masscheck is with "reuse", which means the ham/spam corpora must be 
tagged with this custom rule during delivery and masscheck only records 
yes/no from the existing spamassassin headers.  It is a waste of 
resources and with misleading results to test these rules on FRESH type 
rules on older mail.

>
> And finally; I can't find any information in the Wiki about adding
> plugins to a sandbox, specifically if this is allowed and what the
> loadplugin line should be to get it to load correctly during the
> mass-checks.

What plugin do you have in mind?  If it is your URI shortener plugin I 
would have to strongly protest.  Its current design is good in theory 
but it will never survive in mass production.  It has the capability of 
making spamassassin far too slow while it is too easy to bypass.

Warren

Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf

Posted by Steve Freegard <st...@stevefreegard.com>.

On 04/02/11 11:26, Justin Mason wrote:
> 2011/2/4 Karsten Bräckelmann<gu...@rudersport.de>:
>> On Fri, 2011-02-04 at 04:15 +0100, Karsten Bräckelmann wrote:
>>> On Thu, 2011-02-03 at 16:35 -1000, Warren Togami Jr. wrote:
>>>> I am not sure who "smf" is, but are they aware that "test rules" without
>>>> "tflags nopublish" could possibly be auto-promoted with a score they
>>>> don't expect.  Scores in the sandbox below are ignored.  We don't want a
>>>> repeat of the FSL_RU_URL issue.
>>>
>>> Getting *cough* late here, and I'm terribly tired...
>>
>> Argh. That's what you get for searching at the wrong end. Right after
>> turning off the machine, and heading to bed...
>>
>> That's Steve Freegard, a recent addition to the committers. :)
>>
>> Anyway, I stand to my recommendation to just add the tflags. This is
>> something we all need to get into the habit
>>
>>> Warren, IMHO, please feel free to add the tflags necessary. Yes, inside
>>> someone else's sandbox. I guess the recent discussion about rules
>>> possibly spreading unintended didn't reach everyone. Although it won't
>>> right now, as we know -- but the corpora being under-limit is another
>>> topic, unrelated to best practices. Adding tflags nopublish is not
>>> intrusive anyway, and warranted in this case, I guess.
>
> +1
>
> btw, iirc it's possible to set "tflags nopublish" on a file-by-file
> basis -- add "#testrules" at the top of the file, and all rules after
> that point are implicitly nopublish.  see
> rulesrc/sandbox/jm/20_bug_5984.cf for an example.
>
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5545 has the background.
>

My apologies for the rookie mistake.  As a recent addition to the team I 
missed the original discussion on requiring 'tflags nopublish' for all 
sandbox rules.

Can someone point me at the Bug that discussed this and I'll update the 
RuleSandboxes Wiki entry to reflect this.

While I'm on the subject of Sandboxes; what are the rules for adding 
DNSBLs to it for testing?  Specifically I would like to add the fresh, 
fresh10 and fresh15 lists from spameatingmonkey.com for testing as my 
own local testing found them to be more effective and faster than the 
day-old-bread list that is currently in the core ruleset.

And finally; I can't find any information in the Wiki about adding 
plugins to a sandbox, specifically if this is allowed and what the 
loadplugin line should be to get it to load correctly during the 
mass-checks.

I'll then update the Wiki on all of these points.

Kind regards,
Steve.

Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf

Posted by Justin Mason <jm...@jmason.org>.

2011/2/4 Karsten Bräckelmann <gu...@rudersport.de>:
> On Fri, 2011-02-04 at 04:15 +0100, Karsten Bräckelmann wrote:
>> On Thu, 2011-02-03 at 16:35 -1000, Warren Togami Jr. wrote:
>> > I am not sure who "smf" is, but are they aware that "test rules" without
>> > "tflags nopublish" could possibly be auto-promoted with a score they
>> > don't expect.  Scores in the sandbox below are ignored.  We don't want a
>> > repeat of the FSL_RU_URL issue.
>>
>> Getting *cough* late here, and I'm terribly tired...
>
> Argh. That's what you get for searching at the wrong end. Right after
> turning off the machine, and heading to bed...
>
> That's Steve Freegard, a recent addition to the committers. :)
>
> Anyway, I stand to my recommendation to just add the tflags. This is
> something we all need to get into the habit
>
>> Warren, IMHO, please feel free to add the tflags necessary. Yes, inside
>> someone else's sandbox. I guess the recent discussion about rules
>> possibly spreading unintended didn't reach everyone. Although it won't
>> right now, as we know -- but the corpora being under-limit is another
>> topic, unrelated to best practices. Adding tflags nopublish is not
>> intrusive anyway, and warranted in this case, I guess.

+1

btw, iirc it's possible to set "tflags nopublish" on a file-by-file
basis -- add "#testrules" at the top of the file, and all rules after
that point are implicitly nopublish.  see
rulesrc/sandbox/jm/20_bug_5984.cf for an example.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5545 has the background.

--j.

Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Fri, 2011-02-04 at 04:15 +0100, Karsten Bräckelmann wrote:
> On Thu, 2011-02-03 at 16:35 -1000, Warren Togami Jr. wrote:
> > I am not sure who "smf" is, but are they aware that "test rules" without 
> > "tflags nopublish" could possibly be auto-promoted with a score they 
> > don't expect.  Scores in the sandbox below are ignored.  We don't want a 
> > repeat of the FSL_RU_URL issue.
> 
> Getting *cough* late here, and I'm terribly tired...

Argh. That's what you get for searching at the wrong end. Right after
turning off the machine, and heading to bed...

That's Steve Freegard, a recent addition to the committers. :)

Anyway, I stand to my recommendation to just add the tflags. This is
something we all need to get into the habit.


> Warren, IMHO, please feel free to add the tflags necessary. Yes, inside
> someone else's sandbox. I guess the recent discussion about rules
> possibly spreading unintended didn't reach everyone. Although it won't
> right now, as we know -- but the corpora being under-limit is another
> topic, unrelated to best practices. Adding tflags nopublish is not
> intrusive anyway, and warranted in this case, I guess.
> 
> Definitely, good catch, Warren.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}