You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Karsten Bräckelmann <gu...@rudersport.de> on 2011/02/04 04:15:44 UTC
Re: svn commit: r1067022 -
/spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf
On Thu, 2011-02-03 at 16:35 -1000, Warren Togami Jr. wrote:
> I am not sure who "smf" is, but are they aware that "test rules" without
> "tflags nopublish" could possibly be auto-promoted with a score they
> don't expect. Scores in the sandbox below are ignored. We don't want a
> repeat of the FSL_RU_URL issue.
Getting *cough* late here, and I'm terribly tired...
Warren, IMHO, please feel free to add the tflags necessary. Yes, inside
someone else's sandbox. I guess the recent discussion about rules
possibly spreading unintended didn't reach everyone. Although it won't
right now, as we know -- but the corpora being under-limit is another
topic, unrelated to best practices. Adding tflags nopublish is not
intrusive anyway, and warranted in this case, I guess.
Definitely, good catch, Warren.
--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf
Posted by "Warren Togami Jr." <wt...@gmail.com>.
On 2/9/2011 2:25 PM, Steve Freegard wrote:
>> What plugin do you have in mind? If it is your URI shortener plugin I
>> would have to strongly protest. Its current design is good in theory
>> but it will never survive in mass production. It has the capability of
>> making spamassassin far too slow while it is too easy to bypass.
>
> No - I wouldn't dream of putting the full short URI decoder into the
> mass-checks in it's current form; but I would put a version that simply
> reported when short URIs were detected (BTW - the rules that do this in
> your sandbox need to backslash escape the dots) as it's more efficient
> to do this in a plug-in with easily expandable list than with a URI rule
> and massive regexps IMO. We should have these sorts of 'instrumentation'
> rules in the corpus that tell us when a particular abuse vector is being
> more heavily exploited and allows them to be used in meta rules.
>
> Anyway - I was talking in more general terms; I have some ideas for a
> few other plugins.
>
> Regards,
> Steve.
Go ahead and delete my crappy huge shortener rule after you have
something better to replace it.
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/wtogami/README
Here is my open policy. I invite other sandboxers to have their own.
Warren
Re: svn commit: r1067022 -
/spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf
Posted by Henrik Krohns <he...@hege.li>.
On Thu, Feb 10, 2011 at 12:25:17AM +0000, Steve Freegard wrote:
>
> as it's more efficient to do this in a plug-in with easily expandable list
> than with a URI rule and massive regexps IMO.
Nah. Given the tiny size of the list, it's far more efficient to use a
single large regexp than going through the work using plugin that iterates
URIs internally and still has to match the stuff somehow (regexp or hashes).
Of course you can use a simple script to generate the rule from a list using
Regexp::Assemble.
Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf
Posted by Steve Freegard <st...@stevefreegard.com>.
On 09/02/11 23:31, Warren Togami Jr. wrote:
> On 2/4/2011 7:40 AM, Steve Freegard wrote:
>> My apologies for the rookie mistake. As a recent addition to the team I
>> missed the original discussion on requiring 'tflags nopublish' for all
>> sandbox rules.
>>
>> Can someone point me at the Bug that discussed this and I'll update the
>> RuleSandboxes Wiki entry to reflect this.
>
> Currently this is only among our "tribal knowledge". The Wiki needs a
> bit of reorg to put this stuff into a logical place.
>
Ok - I'll update the RuleSandboxes entry. Re-organizing the wiki is a
totally separate issue.
>>
>> While I'm on the subject of Sandboxes; what are the rules for adding
>> DNSBLs to it for testing? Specifically I would like to add the fresh,
>> fresh10 and fresh15 lists from spameatingmonkey.com for testing as my
>> own local testing found them to be more effective and faster than the
>> day-old-bread list that is currently in the core ruleset.
>
> A critical flaw here is the only way to truly measure its performance
> in masscheck is with "reuse", which means the ham/spam corpora must be
> tagged with this custom rule during delivery and masscheck only
> records yes/no from the existing spamassassin headers. It is a waste
> of resources and with misleading results to test these rules on FRESH
> type rules on older mail.
>
Unfortunately I can see many issues with requiring that all corpora must
contain SpamAssassin mark-up at the time it was received; this simply
isn't practical for some people (myself included as my trap data comes
from lots of different sources).
Would this not also be dangerous if the SA headers in the message were
generated by an older version of SA or one that had been modified or
were running with local-tests-only or intentionally or unintentionally
poisoned to skew the results?
Blacklist effectiveness for things like day-old-bread and fresh5/10/15
can be checked by looking at the message age versus the hit percentage
which IIRC the ruleqa app can do when the rule detail is shown.
This brings me onto another subject:
One thing that strikes me as missing from the ruleqa app is a way to
show the overall score distribution for each submitter for ham/spam and
the overall number of false positives and false negatives in each corpus
(and a total). This would be to check two things:
1) The overall corpus quality (e.g. lots of low or negative scoring
mail in a spam corpus would potentially cause concern as with ham and
very high scores).
2) The overall effectiveness of the current rules against the corpora.
The focus should then be on writing rules and plug-ins to deal with the
extremes e.g. the false-positives or false-negatives present and reduce
the overall percentages of both classes.
I'm doing this myself by parsing the local mass-check logs and copying
any false-positives or false-negatives into specific directories so that
I can sort through them and update rules as necessary.
>>
>> And finally; I can't find any information in the Wiki about adding
>> plugins to a sandbox, specifically if this is allowed and what the
>> loadplugin line should be to get it to load correctly during the
>> mass-checks.
>
> What plugin do you have in mind? If it is your URI shortener plugin I
> would have to strongly protest. Its current design is good in theory
> but it will never survive in mass production. It has the capability
> of making spamassassin far too slow while it is too easy to bypass.
No - I wouldn't dream of putting the full short URI decoder into the
mass-checks in it's current form; but I would put a version that simply
reported when short URIs were detected (BTW - the rules that do this in
your sandbox need to backslash escape the dots) as it's more efficient
to do this in a plug-in with easily expandable list than with a URI rule
and massive regexps IMO. We should have these sorts of
'instrumentation' rules in the corpus that tell us when a particular
abuse vector is being more heavily exploited and allows them to be used
in meta rules.
Anyway - I was talking in more general terms; I have some ideas for a
few other plugins.
Regards,
Steve.
Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf
Posted by "Warren Togami Jr." <wt...@gmail.com>.
On 2/4/2011 7:40 AM, Steve Freegard wrote:
> My apologies for the rookie mistake. As a recent addition to the team I
> missed the original discussion on requiring 'tflags nopublish' for all
> sandbox rules.
>
> Can someone point me at the Bug that discussed this and I'll update the
> RuleSandboxes Wiki entry to reflect this.
Currently this is only among our "tribal knowledge". The Wiki needs a
bit of reorg to put this stuff into a logical place.
>
> While I'm on the subject of Sandboxes; what are the rules for adding
> DNSBLs to it for testing? Specifically I would like to add the fresh,
> fresh10 and fresh15 lists from spameatingmonkey.com for testing as my
> own local testing found them to be more effective and faster than the
> day-old-bread list that is currently in the core ruleset.
A critical flaw here is the only way to truly measure its performance in
masscheck is with "reuse", which means the ham/spam corpora must be
tagged with this custom rule during delivery and masscheck only records
yes/no from the existing spamassassin headers. It is a waste of
resources and with misleading results to test these rules on FRESH type
rules on older mail.
>
> And finally; I can't find any information in the Wiki about adding
> plugins to a sandbox, specifically if this is allowed and what the
> loadplugin line should be to get it to load correctly during the
> mass-checks.
What plugin do you have in mind? If it is your URI shortener plugin I
would have to strongly protest. Its current design is good in theory
but it will never survive in mass production. It has the capability of
making spamassassin far too slow while it is too easy to bypass.
Warren
Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf
Posted by Steve Freegard <st...@stevefreegard.com>.
On 04/02/11 11:26, Justin Mason wrote:
> 2011/2/4 Karsten Bräckelmann<gu...@rudersport.de>:
>> On Fri, 2011-02-04 at 04:15 +0100, Karsten Bräckelmann wrote:
>>> On Thu, 2011-02-03 at 16:35 -1000, Warren Togami Jr. wrote:
>>>> I am not sure who "smf" is, but are they aware that "test rules" without
>>>> "tflags nopublish" could possibly be auto-promoted with a score they
>>>> don't expect. Scores in the sandbox below are ignored. We don't want a
>>>> repeat of the FSL_RU_URL issue.
>>>
>>> Getting *cough* late here, and I'm terribly tired...
>>
>> Argh. That's what you get for searching at the wrong end. Right after
>> turning off the machine, and heading to bed...
>>
>> That's Steve Freegard, a recent addition to the committers. :)
>>
>> Anyway, I stand to my recommendation to just add the tflags. This is
>> something we all need to get into the habit
>>
>>> Warren, IMHO, please feel free to add the tflags necessary. Yes, inside
>>> someone else's sandbox. I guess the recent discussion about rules
>>> possibly spreading unintended didn't reach everyone. Although it won't
>>> right now, as we know -- but the corpora being under-limit is another
>>> topic, unrelated to best practices. Adding tflags nopublish is not
>>> intrusive anyway, and warranted in this case, I guess.
>
> +1
>
> btw, iirc it's possible to set "tflags nopublish" on a file-by-file
> basis -- add "#testrules" at the top of the file, and all rules after
> that point are implicitly nopublish. see
> rulesrc/sandbox/jm/20_bug_5984.cf for an example.
>
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5545 has the background.
>
My apologies for the rookie mistake. As a recent addition to the team I
missed the original discussion on requiring 'tflags nopublish' for all
sandbox rules.
Can someone point me at the Bug that discussed this and I'll update the
RuleSandboxes Wiki entry to reflect this.
While I'm on the subject of Sandboxes; what are the rules for adding
DNSBLs to it for testing? Specifically I would like to add the fresh,
fresh10 and fresh15 lists from spameatingmonkey.com for testing as my
own local testing found them to be more effective and faster than the
day-old-bread list that is currently in the core ruleset.
And finally; I can't find any information in the Wiki about adding
plugins to a sandbox, specifically if this is allowed and what the
loadplugin line should be to get it to load correctly during the
mass-checks.
I'll then update the Wiki on all of these points.
Kind regards,
Steve.
Re: svn commit: r1067022 - /spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf
Posted by Justin Mason <jm...@jmason.org>.
2011/2/4 Karsten Bräckelmann <gu...@rudersport.de>:
> On Fri, 2011-02-04 at 04:15 +0100, Karsten Bräckelmann wrote:
>> On Thu, 2011-02-03 at 16:35 -1000, Warren Togami Jr. wrote:
>> > I am not sure who "smf" is, but are they aware that "test rules" without
>> > "tflags nopublish" could possibly be auto-promoted with a score they
>> > don't expect. Scores in the sandbox below are ignored. We don't want a
>> > repeat of the FSL_RU_URL issue.
>>
>> Getting *cough* late here, and I'm terribly tired...
>
> Argh. That's what you get for searching at the wrong end. Right after
> turning off the machine, and heading to bed...
>
> That's Steve Freegard, a recent addition to the committers. :)
>
> Anyway, I stand to my recommendation to just add the tflags. This is
> something we all need to get into the habit
>
>> Warren, IMHO, please feel free to add the tflags necessary. Yes, inside
>> someone else's sandbox. I guess the recent discussion about rules
>> possibly spreading unintended didn't reach everyone. Although it won't
>> right now, as we know -- but the corpora being under-limit is another
>> topic, unrelated to best practices. Adding tflags nopublish is not
>> intrusive anyway, and warranted in this case, I guess.
+1
btw, iirc it's possible to set "tflags nopublish" on a file-by-file
basis -- add "#testrules" at the top of the file, and all rules after
that point are implicitly nopublish. see
rulesrc/sandbox/jm/20_bug_5984.cf for an example.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5545 has the background.
--j.
Re: svn commit: r1067022 -
/spamassassin/trunk/rulesrc/sandbox/smf/20_smf.cf
Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2011-02-04 at 04:15 +0100, Karsten Bräckelmann wrote:
> On Thu, 2011-02-03 at 16:35 -1000, Warren Togami Jr. wrote:
> > I am not sure who "smf" is, but are they aware that "test rules" without
> > "tflags nopublish" could possibly be auto-promoted with a score they
> > don't expect. Scores in the sandbox below are ignored. We don't want a
> > repeat of the FSL_RU_URL issue.
>
> Getting *cough* late here, and I'm terribly tired...
Argh. That's what you get for searching at the wrong end. Right after
turning off the machine, and heading to bed...
That's Steve Freegard, a recent addition to the committers. :)
Anyway, I stand to my recommendation to just add the tflags. This is
something we all need to get into the habit.
> Warren, IMHO, please feel free to add the tflags necessary. Yes, inside
> someone else's sandbox. I guess the recent discussion about rules
> possibly spreading unintended didn't reach everyone. Although it won't
> right now, as we know -- but the corpora being under-limit is another
> topic, unrelated to best practices. Adding tflags nopublish is not
> intrusive anyway, and warranted in this case, I guess.
>
> Definitely, good catch, Warren.
--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}