You are viewing a plain text version of this content. The canonical link for it is here.
Posted to ruleqa@spamassassin.apache.org by da...@chaosreigns.com on 2014/05/01 23:20:21 UTC

Corpus cleaning - keeping a list of already verified non-spams

To make corpus cleaning easier next time, you can save a list of emails
that scored high that weren't spam, to automatically skip. When viewing
emails as suggested in https://wiki.apache.org/spamassassin/CorpusCleaning
they have a "X-Mass-Check-Id:" header which lists the file they came from,
which you can use to remove any email that was actually spam from the
id.hi file. Then copy the id.hi file to something like ~/sa/id.hi.good
and next time run:

sort -rn -k 2 ham.log | fgrep -vf ~/sa/id.hi.good | head -n 200 > id.hi
./mboxget < id.hi > mbox
mutt -f mbox

Added as:
https://wiki.apache.org/spamassassin/CorpusCleaning#Saving_a_list_of_verified_non-spams

(Obviously, you can do the same for false positives.)


I remain concerned that people may not be doing this enough (while
recognizing it's irrelevant as long as rsync.spamassassin.org is down).

-- 
"...The people who are crazy enough to think they can change the world,
are the ones who do."  - Steve Jobs
http://www.ChaosReigns.com

Re: Corpus cleaning - keeping a list of already verified non-spams

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
Nice.  I like it.  And I've spent a lot of time getting rsync back up 
and working.  I'll send an update in a moment.


On 5/1/2014 5:20 PM, darxus@chaosreigns.com wrote:
> To make corpus cleaning easier next time, you can save a list of emails
> that scored high that weren't spam, to automatically skip. When viewing
> emails as suggested in https://wiki.apache.org/spamassassin/CorpusCleaning
> they have a "X-Mass-Check-Id:" header which lists the file they came from,
> which you can use to remove any email that was actually spam from the
> id.hi file. Then copy the id.hi file to something like ~/sa/id.hi.good
> and next time run:
>
> sort -rn -k 2 ham.log | fgrep -vf ~/sa/id.hi.good | head -n 200 > id.hi
> ./mboxget < id.hi > mbox
> mutt -f mbox
>
> Added as:
> https://wiki.apache.org/spamassassin/CorpusCleaning#Saving_a_list_of_verified_non-spams
>
> (Obviously, you can do the same for false positives.)
>
>
> I remain concerned that people may not be doing this enough (while
> recognizing it's irrelevant as long as rsync.spamassassin.org is down).


Re: Corpus cleaning - keeping a list of already verified non-spams

Posted by Kevin Golding <kp...@caomhin.org>.
On Thu, 01 May 2014 22:20:21 +0100, <da...@chaosreigns.com> wrote:

> I remain concerned that people may not be doing this enough (while
> recognizing it's irrelevant as long as rsync.spamassassin.org is down).

I must admit I probably don't do enough so I figured it made sense to make  
this a weekly task using the net checks just to base it on the most  
complete tests we run.

Then on a test I skimmed the samples and realised that they looked  
correctly sorted and that logically if I run it weekly I'll expect to only  
see a handful of messages changing. Either a few new ones added to the  
corpus or a few old ones drifting out to let others in. I suspect the  
extremes are good for a periodic check but if I'm going to do it regularly  
I think I'll make more difference in using a random spotcheck.

So this is now spotcheck.sh and runs Saturday evenings for me:

#!/bin/sh

cd /home/masscheck/masscheckwork/weekly_mass_check/masses

sort -n -k 2 spam-net-kpg-core.log | head -100 >  
/home/masscheck/reports/low.spam
random -f spam-net-kpg-core.log | head -100 >  
/home/masscheck/reports/rand.spam
sort -rn -k 2 ham-net-kpg-core.log | head -100 >  
/home/masscheck/reports/high.ham
random -f ham-net-kpg-core.log | head -100 >  
/home/masscheck/reports/rand.ham

cd /home/masscheck/reports/

grep -e "^[\.Y]" low.spam | awk '{ print $3 }' > verify.todo
grep -e "^[\.Y]" rand.spam | awk '{ print $3 }' >> verify.todo
grep -e "^[\.Y]" high.ham | awk '{ print $3 }' >> verify.todo
grep -e "^[\.Y]" rand.ham | awk '{ print $3 }' >> verify.todo


I have the 100 most extreme samples (although not quite 100 as I notice I  
get the headers in the low spam search - hence the grep to filter - and I  
just did it on all the reports to make sure) plus 100 random samples from  
both ham and spam.

The logic being that I then dump any file in verify.todo back into a  
verification folder to sort with my usual sifting. By combining them and  
including some variety with the randoms I figure I'm more likely to look  
carefully at the messages before reintroducing them to my corpus (I know  
that personally I stand a high risk of going "Yeah, I rock, back you go"  
with just a quick skim if I do split the ham and spam).