You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Karsten Bräckelmann <gu...@rudersport.de> on 2011/02/02 00:02:42 UTC

Mass-check Corpora (once was: Re: Update Mirror Issues)

> > > > SPAM: 51330 (150000 required)
> > >
> > > Joao Gouveia will soon be requesting an account to join the nightly
> > > masscheck. He has a significant quantity of spam, and hopefully much
> > > of it is European language so it should add to our diversity.
> >
> > I wonder how scoring will be affected if his corpus is >50k messages?
> > :)
> 
> Yikes.  He has over 1 million per day spam.  He's figuring out a way to 
> filter it to eliminate duplicates and do a random sample of ~20k * 7 
> days.  But still, that's going to skew us too much.

Yikes indeed.

Maybe Joao should answer these himself...

Given the numbers, is that purely trap driven? Is there a legion human
users manually verifying the spam?

What exactly does "filter duplicates" mean? If that includes "identical"
payload sent to different users, these dupes should not be eliminated I
believe, since it will bias results. A random sample already will
eliminate most duplicates, while preserving distribution.

Is there also ham?


Regarding skewing of results due to a single source with overwhelming
numbers: I recall days, where mass-checks (though not for scoring)
basically consisted of one huge corpus, and a bunch of additional,
*much* smaller corpora. It did indeed have an impact on quite a few
rules, hardly matching the dominant corpus at all, though others quite
nicely. :/


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Mass-check Corpora (once was: Re: Update Mirror Issues)

Posted by "Warren Togami Jr." <wt...@gmail.com>.
On 2/2/2011 1:01 AM, Justin Mason wrote:
> 2011/2/2 Warren Togami Jr.<wt...@gmail.com>:
>> On 2/1/2011 1:02 PM, Karsten Bräckelmann wrote:
>>>
>>> Yikes indeed.
>>>
>>> Maybe Joao should answer these himself...
>>>
>>> Given the numbers, is that purely trap driven? Is there a legion human
>>> users manually verifying the spam?
>>>
>>> What exactly does "filter duplicates" mean? If that includes "identical"
>>> payload sent to different users, these dupes should not be eliminated I
>>> believe, since it will bias results. A random sample already will
>>> eliminate most duplicates, while preserving distribution.
>>
>> Good point. +1
>
> +1.
>
> My approach btw when dealing with traps is to (a) upload those using a
> distinct filename if possible (e.g. "ham-jm-traps.log" or similar),
> and (b) sample randomly to get the volume down to something comparable
> to the other corpora.  Trap spam tends to contain  bounce blowback and
> other "noise" that we don't necessarily want in large numbers in our
> corpora.

Good point about bounce blowback (or backscatter as some people call 
it).  I forgot about that because my traps automatically filter that out 
from the corpus.

Warren

Re: Mass-check Corpora (once was: Re: Update Mirror Issues)

Posted by Justin Mason <jm...@jmason.org>.
2011/2/2 Warren Togami Jr. <wt...@gmail.com>:
> On 2/1/2011 1:02 PM, Karsten Bräckelmann wrote:
>>
>> Yikes indeed.
>>
>> Maybe Joao should answer these himself...
>>
>> Given the numbers, is that purely trap driven? Is there a legion human
>> users manually verifying the spam?
>>
>> What exactly does "filter duplicates" mean? If that includes "identical"
>> payload sent to different users, these dupes should not be eliminated I
>> believe, since it will bias results. A random sample already will
>> eliminate most duplicates, while preserving distribution.
>
> Good point. +1

+1.

My approach btw when dealing with traps is to (a) upload those using a
distinct filename if possible (e.g. "ham-jm-traps.log" or similar),
and (b) sample randomly to get the volume down to something comparable
to the other corpora.  Trap spam tends to contain  bounce blowback and
other "noise" that we don't necessarily want in large numbers in our
corpora.

Re: Mass-check Corpora (once was: Re: Update Mirror Issues)

Posted by "Warren Togami Jr." <wt...@gmail.com>.
On 2/1/2011 1:02 PM, Karsten Bräckelmann wrote:
> Yikes indeed.
>
> Maybe Joao should answer these himself...
>
> Given the numbers, is that purely trap driven? Is there a legion human
> users manually verifying the spam?
>
> What exactly does "filter duplicates" mean? If that includes "identical"
> payload sent to different users, these dupes should not be eliminated I
> believe, since it will bias results. A random sample already will
> eliminate most duplicates, while preserving distribution.

Good point. +1

Warren