You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Warren Togami <wt...@redhat.com> on 2009/09/28 07:44:28 UTC

Commit access to svn

Hi folks,

What do I need to do to gain commit access?  I sent in the signed Apache 
CLA a few weeks ago but I heard nothing back.

My plans initially are only to put new tests into the sandbox to see how 
they do.

* Get Adam Katz's KHOP rules updated in the sandbox so they can be 
properly tested.

* Sandbox testing of additional blacklists like JMF, SEM

* Split PSBL into sub rules.  RCVD_IN_PSBL is currently looking at all 
headers instead of just last-external.  This can work very well.  But I 
believe there is a simple way to improve this furter by splitting it 
into two subrules.  This change can be made after the GA rescoring if 
the rule is split properly.

Use RCVD_IN_PSBL_2WEEKS to assign a score.  RCVD_IN_PSBL_DEEP would be 
te equivalent to RCVD_IN_PSBL_2WEEKS.  The stricter RCVD_IN_PSBL would 
be a subrule that matches only with last-external, thereby being 
stricter and eliminating most of the already mininuscule chance of false 
positives.  Thus the full score of RCVD_IN_PSBL_2WEEKS would be split 
into two parts.

Before
RCVD_IN_PSBL_2WEEKS score 2
This rule does deep parsing which is often good, but sometimes bad.

After
RCVD_IN_PSBL score 2
This rule matces only last-external making it safer from FP's.
RCVD_IN_PSBL_DEEP score -1
This rule is can be scored separately, subtracting a tiny amount if the 
PSBL hit was found in deep parsing.  Both rules would trigger, one adds, 
the second subtracts.  The subtracting rule would never fire on its own.

* I am also looking at ways to expand the use of the SOUGHT methodology. 
  Either improve the existing SOUGHT, or launch a separate SOUGHT-like 
channel based upon an entirely different corpus.  For example, Japanese 
spam trap corpus + Japanese ham corpus = SOUGHT-JP nightly sa-update 
channel.  I'm even seeing big spam differences between jm's corpus 
generated sought rules and my own corpus.  There is room for improvement 
with the current SOUGHT.

Warren Togami
wtogami@redhat.com

Re: Commit access to svn

Posted by Justin Mason <jm...@jmason.org>.

On Mon, Sep 28, 2009 at 06:44, Warren Togami <wt...@redhat.com> wrote:

> Hi folks,
>
> What do I need to do to gain commit access?  I sent in the signed Apache
> CLA a few weeks ago but I heard nothing back.
>

that's normal; we don't automatically create an account on CLA receipt, and
we generally need the CLA much earlier than that (if the contributions have
already been significant enough).

I'll propose it to the pmc list and I'm pretty sure we'll be voting you in.
as per http://wiki.apache.org/spamassassin/ProjectRoles , you haven't
contributed enough bad code to do otherwise ;)

--j.


> My plans initially are only to put new tests into the sandbox to see how
> they do.
>
> * Get Adam Katz's KHOP rules updated in the sandbox so they can be properly
> tested.
>
> * Sandbox testing of additional blacklists like JMF, SEM
>
> * Split PSBL into sub rules.  RCVD_IN_PSBL is currently looking at all
> headers instead of just last-external.  This can work very well.  But I
> believe there is a simple way to improve this furter by splitting it into
> two subrules.  This change can be made after the GA rescoring if the rule is
> split properly.
>
> Use RCVD_IN_PSBL_2WEEKS to assign a score.  RCVD_IN_PSBL_DEEP would be te
> equivalent to RCVD_IN_PSBL_2WEEKS.  The stricter RCVD_IN_PSBL would be a
> subrule that matches only with last-external, thereby being stricter and
> eliminating most of the already mininuscule chance of false positives.  Thus
> the full score of RCVD_IN_PSBL_2WEEKS would be split into two parts.
>
> Before
> RCVD_IN_PSBL_2WEEKS score 2
> This rule does deep parsing which is often good, but sometimes bad.
>
> After
> RCVD_IN_PSBL score 2
> This rule matces only last-external making it safer from FP's.
> RCVD_IN_PSBL_DEEP score -1
> This rule is can be scored separately, subtracting a tiny amount if the
> PSBL hit was found in deep parsing.  Both rules would trigger, one adds, the
> second subtracts.  The subtracting rule would never fire on its own.
>
> * I am also looking at ways to expand the use of the SOUGHT methodology.
>  Either improve the existing SOUGHT, or launch a separate SOUGHT-like
> channel based upon an entirely different corpus.  For example, Japanese spam
> trap corpus + Japanese ham corpus = SOUGHT-JP nightly sa-update channel.
>  I'm even seeing big spam differences between jm's corpus generated sought
> rules and my own corpus.  There is room for improvement with the current
> SOUGHT.
>
> Warren Togami
> wtogami@redhat.com
>
>


-- 
--j.

Re: Commit access to svn

Posted by Warren Togami <wt...@redhat.com>.

On 09/29/2009 08:49 AM, Justin Mason wrote:
>
>     I assume this means we are capable of both deep parsing and
>     lastexternal with a single lookup?
>
>
> Good question.  I think so but am not certain.
>
> --
> --j.

If we can't do it both in one query, then we shouldn't bother with a 
second deep parsing rule.

Warren

Re: Commit access to svn

Posted by Justin Mason <jm...@jmason.org>.

> I assume this means we are capable of both deep parsing and lastexternal
> with a single lookup?
>

Good question.  I think so but am not certain.

-- 
--j.

Re: Commit access to svn

Posted by Warren Togami <wt...@redhat.com>.

On 09/28/2009 04:37 PM, Justin Mason wrote:
>
>
> On Mon, Sep 28, 2009 at 21:35, Warren Togami <wtogami@redhat.com
> <ma...@redhat.com>> wrote:
>
>     On 09/28/2009 04:32 PM, Justin Mason wrote:
>
>         I agree we should have used lastexternal.  we can do the 'subtract'
>         trick but I'd prefer to do it by simply splitting the rules into a
>         RCVD_IN_PSBL_LASTEXTERNAL (score 2) and RCVD_IN_PSBL_DEEP (score 1),
>         possibly using metas, so that users don't see a confusingly negative
>         score hitting on spam -- principle of least surprise and all that.
>
>
>     Could the lastexternal version be called simply RCVD_IN_PSBL?  That
>     seems to be expected of DNSBL's and shorter name is better I guess.
>
>
> sure, that works for me.
>
> --
> --j.

I assume this means we are capable of both deep parsing and lastexternal 
with a single lookup?

Warren

Re: Commit access to svn

Posted by Justin Mason <jm...@jmason.org>.

On Mon, Sep 28, 2009 at 21:35, Warren Togami <wt...@redhat.com> wrote:

> On 09/28/2009 04:32 PM, Justin Mason wrote:
>
>> I agree we should have used lastexternal.  we can do the 'subtract'
>> trick but I'd prefer to do it by simply splitting the rules into a
>> RCVD_IN_PSBL_LASTEXTERNAL (score 2) and RCVD_IN_PSBL_DEEP (score 1),
>> possibly using metas, so that users don't see a confusingly negative
>> score hitting on spam -- principle of least surprise and all that.
>>
>
> Could the lastexternal version be called simply RCVD_IN_PSBL?  That seems
> to be expected of DNSBL's and shorter name is better I guess.
>

sure, that works for me.

-- 
--j.

Re: Commit access to svn

Posted by Warren Togami <wt...@redhat.com>.

On 09/28/2009 04:32 PM, Justin Mason wrote:
> I agree we should have used lastexternal.  we can do the 'subtract'
> trick but I'd prefer to do it by simply splitting the rules into a
> RCVD_IN_PSBL_LASTEXTERNAL (score 2) and RCVD_IN_PSBL_DEEP (score 1),
> possibly using metas, so that users don't see a confusingly negative
> score hitting on spam -- principle of least surprise and all that.

Could the lastexternal version be called simply RCVD_IN_PSBL?  That 
seems to be expected of DNSBL's and shorter name is better I guess.

Warren

Re: Commit access to svn

Posted by Justin Mason <jm...@jmason.org>.

On Mon, Sep 28, 2009 at 15:33, Warren Togami <wt...@redhat.com> wrote:

> On 09/28/2009 01:44 AM, Warren Togami wrote:
>
>>
>> Use RCVD_IN_PSBL_2WEEKS to assign a score. RCVD_IN_PSBL_DEEP would be te
>> equivalent to RCVD_IN_PSBL_2WEEKS. The stricter RCVD_IN_PSBL would be a
>> subrule that matches only with last-external, thereby being stricter and
>> eliminating most of the already mininuscule chance of false positives.
>> Thus the full score of RCVD_IN_PSBL_2WEEKS would be split into two parts.
>>
>> Before
>> RCVD_IN_PSBL_2WEEKS score 2
>> This rule does deep parsing which is often good, but sometimes bad.
>>
>> After
>> RCVD_IN_PSBL score 2
>> This rule matces only last-external making it safer from FP's.
>> RCVD_IN_PSBL_DEEP score -1
>> This rule is can be scored separately, subtracting a tiny amount if the
>> PSBL hit was found in deep parsing. Both rules would trigger, one adds,
>> the second subtracts. The subtracting rule would never fire on its own.
>>
>
> OK, the above "subtract" probably needs some explanation.
>
> This came from a feeling of discomfort with deep parsing of PSBL.  PSBL
> *is* working well in masscheck with deep parsing with very few FP's. The
> trouble is these FP's like sending an e-mail from wireless broadband card
> via a legitimate mail server is legitimately blacklisted.  Even though this
> alone is not likely to cause their mail to be classified as spam with the
> default threshold of 5, there is nothing the user can do about the previous
> user of that IP having sent spam.
>
> For this reason I think we should have used psbl-lastexternal.
> psbl-lastexternal is extra certain to be correct and deserves a high score.
> [1]  Deep parsing however has shown to be mostly correct and probably
> deserves a smaller score in cases where psbl-lastexternal didn't hit.  Can
> spamassassin do separate sub-rule matches of lastexternal and deep parsing
> without querying twice?
>
> [1] We are still hitting some yahoo FP's because filtering out Yahoo from
> the blacklist was broken until a few days ago.  These should disappear
> entirely by the two week timeout.


I agree we should have used lastexternal.  we can do the 'subtract' trick
but I'd prefer to do it by simply splitting the rules into a
RCVD_IN_PSBL_LASTEXTERNAL (score 2) and RCVD_IN_PSBL_DEEP (score 1),
possibly using metas, so that users don't see a confusingly negative score
hitting on spam -- principle of least surprise and all that.

-- 
--j.

Re: Commit access to svn

Posted by Warren Togami <wt...@redhat.com>.

On 09/28/2009 01:44 AM, Warren Togami wrote:
>
> Use RCVD_IN_PSBL_2WEEKS to assign a score. RCVD_IN_PSBL_DEEP would be te
> equivalent to RCVD_IN_PSBL_2WEEKS. The stricter RCVD_IN_PSBL would be a
> subrule that matches only with last-external, thereby being stricter and
> eliminating most of the already mininuscule chance of false positives.
> Thus the full score of RCVD_IN_PSBL_2WEEKS would be split into two parts.
>
> Before
> RCVD_IN_PSBL_2WEEKS score 2
> This rule does deep parsing which is often good, but sometimes bad.
>
> After
> RCVD_IN_PSBL score 2
> This rule matces only last-external making it safer from FP's.
> RCVD_IN_PSBL_DEEP score -1
> This rule is can be scored separately, subtracting a tiny amount if the
> PSBL hit was found in deep parsing. Both rules would trigger, one adds,
> the second subtracts. The subtracting rule would never fire on its own.

OK, the above "subtract" probably needs some explanation.

This came from a feeling of discomfort with deep parsing of PSBL.  PSBL 
*is* working well in masscheck with deep parsing with very few FP's. 
The trouble is these FP's like sending an e-mail from wireless broadband 
card via a legitimate mail server is legitimately blacklisted.  Even 
though this alone is not likely to cause their mail to be classified as 
spam with the default threshold of 5, there is nothing the user can do 
about the previous user of that IP having sent spam.

For this reason I think we should have used psbl-lastexternal. 
psbl-lastexternal is extra certain to be correct and deserves a high 
score. [1]  Deep parsing however has shown to be mostly correct and 
probably deserves a smaller score in cases where psbl-lastexternal 
didn't hit.  Can spamassassin do separate sub-rule matches of 
lastexternal and deep parsing without querying twice?

[1] We are still hitting some yahoo FP's because filtering out Yahoo 
from the blacklist was broken until a few days ago.  These should 
disappear entirely by the two week timeout.

Warren Togami
wtogami@redhat.com