You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Michael Parker <pa...@pobox.com> on 2012/07/01 05:23:59 UTC

Re: "jarif" corpus on Spamassassin masschecks

On Jun 30, 2012, at 5:56 PM, Jari Fredriksson wrote:

> On 30.6.2012 21:32, João Gouveia wrote:
>> Hi Jarif,
>> 
>> Are you the owner of the "jarif" corpus being used on the Spamassassin
>> masschecks?
>> If so, I'm interested in investigating these classification errors:
>> 
>> http://ruleqa.spamassassin.org/20120630-r1355665-n/RCVD_IN_MSPIKE_BL?mclog=ham-net-jarif
>> 
>> I own and operate MailSpike, and naturally I'm a bit concerned about
>> this false positives.
>> Would it be possible to know the list of IP addresses that caused so
>> many false positives?
>> 
>> Thanks in advance!
>> 
> 
> I had false alarms in my corpus, thanks for posting me this query!
> 
> 1. They were mostly old mails from WorldOfTanks.eu and Facebook.com.
> They did trigger apparently RCVD_IN_MSPIKE_BL in 2011 and early this year.
> 
> 2. None of them trigger it now.
> 
> 3. I have to to remove old SpamAssassin traces from all of my corpus. I
> had thought that SA does it automatically when doing masscheck, but I
> was wrong! I even asked about it in SA dev mailing list, but got no
> answer and made a bad decision to leave the markup to the files.
> 

You shouldn't remove the old SA hits from your corpus.  Those hits are used by reuse rules and are critical for proper accuracy of some rules.

Michael

> No worries, I take corrective action now.
> 
> Thanks, jarif
> 
> ps.
> 
> They seem to trigger still
> 
>   3.5 FROM_12LTRDOM From a 12-letter domain
> 
> Where did that rule come? Really? 12 letters in domain, and it gets 3.5
> points??
> 
> Received: from wot-slave-54.worldoftanks.ru ([213.252.131.54])
>        by ikiaikainen.iki.fi (8.14.4/8.14.4) with ESMTP id p0NIObHR027573
>        for <ja...@iki.fi>; Sun, 23 Jan 2011 20:24:37 +0200 (EET)
> Received: by wot-slave-54.worldoftanks.ru (Postfix, from userid 101)
>        id 32686BB83CC; Sun, 23 Jan 2011 18:24:32 +0000 (UTC)
> 
> 
> It is  worldoftanks.eu and worldoftanks.ru triggering this strange rule.
> 
> All mail from Facebook gets negative points only.
> 
> -- 
> 
> Tomorrow will be cancelled due to lack of interest.
> 	
> 
>

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On Mon, July 2, 2012 09:37, Jari Fredriksson wrote:
>> function remove-unwanted-mail
>> {
>>     echo "$0: removing unwanted $1 mail from corpus"
>>     for file in `egrep -l -m 1
>> "^List-id\:|^(Reply-To|From|Received)\:.*(uusisuomi\.fi|talentum\.com|linkedin\.com|hs\.fi|iltalehti\.fi|nytimes\.com|facebook\.com|facebookmail\.com)|^Delivered-To:
>> washingtonpost@fred.*\.fi|^From\: .*MAILER-DAEMON" \
>>                   Maildir/.Confirmed-$1/cur/*`
>>     do
>>       if test -f "$file"; then
>>         echo -n "Removing $file... "
>>         rm "$file" || exit 1
>>         echo "done."
>>       fi
>>     done
>>
>>     for file in `grep ALL_TRUSTED
>> masscheckwork/*_mass_check/masses/*am-jarif.log | awk '{print $3}'`
>>     do
>>       if test -f "$file"; then
>>         echo -n "Removing $file... "
>>         rm "$file" || exit 1
>>         echo "done."
>>       fi
>>     done
>> }
>>
>> remove-unwanted-mail HAM
>> remove-unwanted-mail SPAM
>>
>>
>
> This is now running always before the masscheck. It ruins the idea of
> Warren, who urged me to order and collect Finnish ham mail from news
> agencies and such, trying to grab a sample of what Finnish email users
> get into their inbox.
>
> There still is this kind of massa email, which is not personal to me:
> railroad (vr.fi) air finnair.(fi|com) to name a couple. Lots of stuff
> will be removed from news agencies like talentum.com, hs.fi,
> iltasanomat.fi. Lots of mail that is not personal. Spirit of wiki page.
>

Too aggressive filter on ^From:
Too aggressive filter on Finnish mail.

Fixed and restored from backup the corpus.

    filter="^List-id\:"
    filter="$filter|^Received\:.*(linkedin\.com|hs\.fi|facebook\.com|facebookmail\.com)"
    filter="$filter|^From:.*(MAILER-DAEMON|nytdirect\@nytimes\.com)"
    filter="$filter|^Delivered-To: washingtonpost@fred.*\.fi"
    for file in `egrep -l -m 1 "$filter" Maildir/.Confirmed-$1/cur/*`

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On Mon, July 2, 2012 09:37, Jari Fredriksson wrote:
>> function remove-unwanted-mail
>> {
>>     echo "$0: removing unwanted $1 mail from corpus"
>>     for file in `egrep -l -m 1
>>
"^List-id\:|^(Reply-To|From|Received)\:.*(uusisuomi\.fi|talentum\.com|linkedin\.com|hs\.fi|iltalehti\.fi|nytimes\.com|facebook\.com|facebookmail\.com)|^Delivered-To:
>> washingtonpost@fred.*\.fi|^From\: .*MAILER-DAEMON" \
>>                   Maildir/.Confirmed-$1/cur/*`
>>     do
>>       if test -f "$file"; then
>>         echo -n "Removing $file... "
>>         rm "$file" || exit 1
>>         echo "done."
>>       fi
>>     done
>>
>>     for file in `grep ALL_TRUSTED
>> masscheckwork/*_mass_check/masses/*am-jarif.log | awk '{print $3}'`
>>     do
>>       if test -f "$file"; then
>>         echo -n "Removing $file... "
>>         rm "$file" || exit 1
>>         echo "done."
>>       fi
>>     done
>> }
>>
>> remove-unwanted-mail HAM
>> remove-unwanted-mail SPAM
>>
>>
>
> This is now running always before the masscheck. It ruins the idea of
> Warren, who urged me to order and collect Finnish ham mail from news
> agencies and such, trying to grab a sample of what Finnish email users
> get into their inbox.
>
> There still is this kind of massa email, which is not personal to me:
> railroad (vr.fi) air finnair.(fi|com) to name a couple. Lots of stuff
> will be removed from news agencies like talentum.com, hs.fi,
> iltasanomat.fi. Lots of mail that is not personal. Spirit of wiki page.
>

Too aggressive filter on ^From:
Too aggressive filter on Finnish mail.

Fixed and restored from backup the corpus.

    filter="^List-id\:"

filter="$filter|^Received\:.*(linkedin\.com|hs\.fi|facebook\.com|facebookmail\.com)"
    filter="$filter|^From:.*(MAILER-DAEMON|nytdirect\@nytimes\.com)"
    filter="$filter|^Delivered-To: washingtonpost@fred.*\.fi"
    for file in `egrep -l -m 1 "$filter" Maildir/.Confirmed-$1/cur/*`

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On 2.7.2012 9:00, Jari Fredriksson wrote:
> On 2.7.2012 5:27, John Hardin wrote:
>> On Sun, 1 Jul 2012, darxus@chaosreigns.com wrote:
>>
>>> On 07/01, Jari Fredriksson wrote:
>>>> Did re-read wiki about cleaning corpus, and removed all mail from
>>>> Facebook
>>>> and Linkedin etc. from corpus. Also mail from MAILER-DAEMON and from
>>>> ALL_TRUSTED removed.
>>>
>>> I wouldn't remove the facebook stuff... linkedin seems kind of evil
>>> though.
>>> But if you got a legit email from facebook, and it hit a blacklist, that
>>> was a legit failure of that blacklist, and valuable information.
>>> Especially since things like sought have a bad habit of inappropriately
>>> causing stuff from facebook to get flagged as spam.
>>>
>>> Removing MAILDER-DAEMON and ALL_TRUSTED stuff is probably fine.
>>
>> I'd mildly disagree. Having ALL_TRUSTED hams is useful for FP analysis
>> and prevention, and having an ALL_TRUSTED spam is equally valuable.
>> ALL_TRUSTED means "not forged", not "not spam".
>>
> 
> I follow the wiki page. I have now implemented the following
> 
> function remove-unwanted-mail
> {
>     echo "$0: removing unwanted $1 mail from corpus"
>     for file in `egrep -l -m 1
> "^List-id\:|^(Reply-To|From|Received)\:.*(uusisuomi\.fi|talentum\.com|linkedin\.com|hs\.fi|iltalehti\.fi|nytimes\.com|facebook\.com|facebookmail\.com)|^Delivered-To:
> washingtonpost@fred.*\.fi|^From\: .*MAILER-DAEMON" \
>                   Maildir/.Confirmed-$1/cur/*`
>     do
>       if test -f "$file"; then
>         echo -n "Removing $file... "
>         rm "$file" || exit 1
>         echo "done."
>       fi
>     done
> 
>     for file in `grep ALL_TRUSTED
> masscheckwork/*_mass_check/masses/*am-jarif.log | awk '{print $3}'`
>     do
>       if test -f "$file"; then
>         echo -n "Removing $file... "
>         rm "$file" || exit 1
>         echo "done."
>       fi
>     done
> }
> 
> remove-unwanted-mail HAM
> remove-unwanted-mail SPAM
> 
> 

This is now running always before the masscheck. It ruins the idea of
Warren, who urged me to order and collect Finnish ham mail from news
agencies and such, trying to grab a sample of what Finnish email users
get into their inbox.

There still is this kind of massa email, which is not personal to me:
railroad (vr.fi) air finnair.(fi|com) to name a couple. Lots of stuff
will be removed from news agencies like talentum.com, hs.fi,
iltasanomat.fi. Lots of mail that is not personal. Spirit of wiki page.

-- 

You will forget that you ever knew me.

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On 3.7.2012 2:24, darxus@chaosreigns.com wrote:
> On 07/02, RW wrote:
>> On Mon, 2 Jul 2012 12:01:32 -0700 (PDT)
>> John Hardin wrote:
>>> On Mon, 2 Jul 2012, Jari Fredriksson wrote:
>>>> http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29
>>>
>>> That says to not include any _spams_ received via those channels, not
>>> to discard them _in toto_.
>>>
>> It actually says:
>>
>>
>> DO NOT include such mail in either ham or spam folder. Just delete it.
>> Why? We don't want to count these as spam, causing false marks against
>> highly safe whitelist rules like USER_IN_DEF_DKIM_WL. They do not count
>> as ham either, because spam URL's or spam text would throw off the
>> statistics if they show up in the ham folder. Simply delete them
> 
> Jari had been deleting non-spam from facebook.  As John said, that wiki
> page says to not include *spam* from places like facebook.  Legit mail
> from facebook, which Jari had been deleting, has value when appropriately
> reported as non-spam.
> 

My so far finalized version of the script deletes only 2 HAMs now from
the whole corpus.

bin/delete-unwanted-mail.sh: removing unwanted HAM mail from corpus
Removing
Maildir/.Confirmed-HAM/cur/1325191790.M834211P3551V000000000000FE00I000000000007650B_0.hurricane,S=8885:2,S...
done.
Removing
Maildir/.Confirmed-HAM/cur/1333374426.M539856P18381V000000000000FE00I00000000000605A5_4.hurricane,S=6426:2,S...
done.
bin/delete-unwanted-mail.sh: removing unwanted SPAM mail from corpus

Those were not really bad ham, but they contained ^List-Id AND
^Received:.*MAILER-DAEMON in an attachment. I do not bother to do
something about those, they are rare examples of HAM. Sent by ezmail
from Debian because I had something wrong in my server and they tried to
send list post to me.

Only two deleted.

-- 

Among the lucky, you are the chosen one.

Re: "jarif" corpus on Spamassassin masschecks

Posted by da...@chaosreigns.com.

On 07/02, RW wrote:
> On Mon, 2 Jul 2012 12:01:32 -0700 (PDT)
> John Hardin wrote:
> > On Mon, 2 Jul 2012, Jari Fredriksson wrote:
> > > http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29
> > 
> > That says to not include any _spams_ received via those channels, not
> > to discard them _in toto_.
> > 
> It actually says:
> 
> 
> DO NOT include such mail in either ham or spam folder. Just delete it.
> Why? We don't want to count these as spam, causing false marks against
> highly safe whitelist rules like USER_IN_DEF_DKIM_WL. They do not count
> as ham either, because spam URL's or spam text would throw off the
> statistics if they show up in the ham folder. Simply delete them

Jari had been deleting non-spam from facebook.  As John said, that wiki
page says to not include *spam* from places like facebook.  Legit mail
from facebook, which Jari had been deleting, has value when appropriately
reported as non-spam.

-- 
"Whom God wishes to destroy, he first makes mad."
- Euripides (c.480 - 406 BC).
http://www.ChaosReigns.com

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On Tue, July 3, 2012 00:59, Jari Fredriksson wrote:
> On Mon, July 2, 2012 22:57, Jari Fredriksson wrote:
>> On 2.7.2012 22:01, John Hardin wrote:
>>> On Mon, 2 Jul 2012, Jari Fredriksson wrote:
>>>
>>>> On 2.7.2012 19:23, darxus@chaosreigns.com wrote:
>>>>> On 07/02, Jari Fredriksson wrote:
>>>>>> I follow the wiki page. I have now implemented the following
>>>>>
>>>>> It seems you are interpreting the wiki as a flawless authority, when
>>>>> it
>>>>> would probably be more appropriate to consider it a crufty guideline
>>>>> that
>>>>> one of us should get around to updating.
>>>>>
>>>>> http://wiki.apache.org/spamassassin/CorpusCleaning
>>>>>
>>>>> Which part of that page made you feel you should strip out facebook?
>>>>>
>>>>
>>>> http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29
>>>>
>>>
>>> That says to not include any _spams_ received via those channels, not
>>> to
>>> discard them _in toto_.
>>
>> Thanks! A good catch I guess, thanks for pointing out this failure in my
>> reading comprehension. But those spammy messages which opened this very
>> thread will be still removed, as they were in HAM corpus and rightly so.
>>
>
> It's actually quite hard to remove those from SPAM, as they may and will
> have a forged linkedin.com or facebook.com Received-header.
>
> I have to manually check the damn spam really carefully.
>
>

spamassassin seems to trigger DKIM_VALID_AU all right for Linkedin, but
says dkim is invalid for Facebook.

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On Mon, July 2, 2012 22:57, Jari Fredriksson wrote:
> On 2.7.2012 22:01, John Hardin wrote:
>> On Mon, 2 Jul 2012, Jari Fredriksson wrote:
>>
>>> On 2.7.2012 19:23, darxus@chaosreigns.com wrote:
>>>> On 07/02, Jari Fredriksson wrote:
>>>>> I follow the wiki page. I have now implemented the following
>>>>
>>>> It seems you are interpreting the wiki as a flawless authority, when
>>>> it
>>>> would probably be more appropriate to consider it a crufty guideline
>>>> that
>>>> one of us should get around to updating.
>>>>
>>>> http://wiki.apache.org/spamassassin/CorpusCleaning
>>>>
>>>> Which part of that page made you feel you should strip out facebook?
>>>>
>>>
>>> http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29
>>>
>>
>> That says to not include any _spams_ received via those channels, not to
>> discard them _in toto_.
>
> Thanks! A good catch I guess, thanks for pointing out this failure in my
> reading comprehension. But those spammy messages which opened this very
> thread will be still removed, as they were in HAM corpus and rightly so.
>

It's actually quite hard to remove those from SPAM, as they may and will
have a forged linkedin.com or facebook.com Received-header.

I have to manually check the damn spam really carefully.

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On 2.7.2012 22:01, John Hardin wrote:
> On Mon, 2 Jul 2012, Jari Fredriksson wrote:
> 
>> On 2.7.2012 19:23, darxus@chaosreigns.com wrote:
>>> On 07/02, Jari Fredriksson wrote:
>>>> I follow the wiki page. I have now implemented the following
>>>
>>> It seems you are interpreting the wiki as a flawless authority, when it
>>> would probably be more appropriate to consider it a crufty guideline
>>> that
>>> one of us should get around to updating.
>>>
>>> http://wiki.apache.org/spamassassin/CorpusCleaning
>>>
>>> Which part of that page made you feel you should strip out facebook?
>>>
>>
>> http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29
>>
> 
> That says to not include any _spams_ received via those channels, not to
> discard them _in toto_.

Thanks! A good catch I guess, thanks for pointing out this failure in my
reading comprehension. But those spammy messages which opened this very
thread will be still removed, as they were in HAM corpus and rightly so.

-- 

Q:	How does a hacker fix a function which
	doesn't work for all of the elements in its domain?
A:	He changes the domain.

Re: "jarif" corpus on Spamassassin masschecks

Posted by John Hardin <jh...@impsec.org>.

On Mon, 2 Jul 2012, RW wrote:

> On Mon, 2 Jul 2012 12:01:32 -0700 (PDT)
> John Hardin wrote:
>
>> On Mon, 2 Jul 2012, Jari Fredriksson wrote:
>>
>>> On 2.7.2012 19:23, darxus@chaosreigns.com wrote:
>>>> On 07/02, Jari Fredriksson wrote:
>>>>> I follow the wiki page. I have now implemented the following
>>>>
>>>> It seems you are interpreting the wiki as a flawless authority,
>>>> when it would probably be more appropriate to consider it a crufty
>>>> guideline that one of us should get around to updating.
>>>>
>>>> http://wiki.apache.org/spamassassin/CorpusCleaning
>>>>
>>>> Which part of that page made you feel you should strip out
>>>> facebook?
>>>>
>>>
>>> http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29
>>
>> That says to not include any _spams_ received via those channels, not
>> to discard them _in toto_.
>
> It actually says:
>
>
> DO NOT include such mail in either ham or spam folder. Just delete it.
> Why? We don't want to count these as spam, causing false marks against
> highly safe whitelist rules like USER_IN_DEF_DKIM_WL. They do not count
> as ham either, because spam URL's or spam text would throw off the
> statistics if they show up in the ham folder. Simply delete them

That is in reference to _spams_ received via facebook et. al., NOT to 
legitimate ham messages received from them.

The complete context makes that clear:

   * Spam Sent via Legitimate Services (Facebook, Livejournal, etc.)
   Occasionally you receive spam text posted to your account on
   services like LiveJournal or Facebook. {followed by the above quote}

Vetted ham messages from facebook, linkedin, livejournal, etc. are 
acceptable and should not be excluded simply because of their source.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Riff: Torg, you traded our magic beans for a _cow_?
   Torg: It's a _magic_ cow! It's full of steaks!
   Riff: Whoa!			                 -- Sluggy 04/28/2002
-----------------------------------------------------------------------
  2 days until the 236th anniversary of the Declaration of Independence

Re: "jarif" corpus on Spamassassin masschecks

Posted by John Hardin <jh...@impsec.org>.

On Mon, 2 Jul 2012, RW wrote:

> On Mon, 2 Jul 2012 12:01:32 -0700 (PDT)
> John Hardin wrote:
>
>> On Mon, 2 Jul 2012, Jari Fredriksson wrote:
>>
>>> On 2.7.2012 19:23, darxus@chaosreigns.com wrote:
>>>> On 07/02, Jari Fredriksson wrote:
>>>>> I follow the wiki page. I have now implemented the following
>>>>
>>>> It seems you are interpreting the wiki as a flawless authority,
>>>> when it would probably be more appropriate to consider it a crufty
>>>> guideline that one of us should get around to updating.
>>>>
>>>> http://wiki.apache.org/spamassassin/CorpusCleaning
>>>>
>>>> Which part of that page made you feel you should strip out
>>>> facebook?
>>>>
>>>
>>> http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29
>>
>> That says to not include any _spams_ received via those channels, not
>> to discard them _in toto_.
>>
> It actually says:
>
>
> DO NOT include such mail in either ham or spam folder. Just delete it.
> Why? We don't want to count these as spam, causing false marks against
> highly safe whitelist rules like USER_IN_DEF_DKIM_WL. They do not count
> as ham either, because spam URL's or spam text would throw off the
> statistics if they show up in the ham folder. Simply delete them

Also, by "discard them in toto" I was referring to the _channels_, not the 
individual messages.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Riff: Torg, you traded our magic beans for a _cow_?
   Torg: It's a _magic_ cow! It's full of steaks!
   Riff: Whoa!			                 -- Sluggy 04/28/2002
-----------------------------------------------------------------------
  2 days until the 236th anniversary of the Declaration of Independence

Re: "jarif" corpus on Spamassassin masschecks

Posted by RW <rw...@googlemail.com>.

On Mon, 2 Jul 2012 12:01:32 -0700 (PDT)
John Hardin wrote:

> On Mon, 2 Jul 2012, Jari Fredriksson wrote:
> 
> > On 2.7.2012 19:23, darxus@chaosreigns.com wrote:
> >> On 07/02, Jari Fredriksson wrote:
> >>> I follow the wiki page. I have now implemented the following
> >>
> >> It seems you are interpreting the wiki as a flawless authority,
> >> when it would probably be more appropriate to consider it a crufty
> >> guideline that one of us should get around to updating.
> >>
> >> http://wiki.apache.org/spamassassin/CorpusCleaning
> >>
> >> Which part of that page made you feel you should strip out
> >> facebook?
> >>
> >
> > http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29
> 
> That says to not include any _spams_ received via those channels, not
> to discard them _in toto_.
> 
It actually says:


DO NOT include such mail in either ham or spam folder. Just delete it.
Why? We don't want to count these as spam, causing false marks against
highly safe whitelist rules like USER_IN_DEF_DKIM_WL. They do not count
as ham either, because spam URL's or spam text would throw off the
statistics if they show up in the ham folder. Simply delete them

Re: "jarif" corpus on Spamassassin masschecks

Posted by John Hardin <jh...@impsec.org>.

On Mon, 2 Jul 2012, Jari Fredriksson wrote:

> On 2.7.2012 19:23, darxus@chaosreigns.com wrote:
>> On 07/02, Jari Fredriksson wrote:
>>> I follow the wiki page. I have now implemented the following
>>
>> It seems you are interpreting the wiki as a flawless authority, when it
>> would probably be more appropriate to consider it a crufty guideline that
>> one of us should get around to updating.
>>
>> http://wiki.apache.org/spamassassin/CorpusCleaning
>>
>> Which part of that page made you feel you should strip out facebook?
>>
>
> http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29

That says to not include any _spams_ received via those channels, not to 
discard them _in toto_.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Taking my gun away because I *might* shoot someone is like cutting
   my tongue out because I *might* yell "Fire!" in a crowded theater.
                                                   -- Peter Venetoklis
-----------------------------------------------------------------------
  2 days until the 236th anniversary of the Declaration of Independence

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On 2.7.2012 19:23, darxus@chaosreigns.com wrote:
> On 07/02, Jari Fredriksson wrote:
>> I follow the wiki page. I have now implemented the following
> 
> It seems you are interpreting the wiki as a flawless authority, when it
> would probably be more appropriate to consider it a crufty guideline that
> one of us should get around to updating.
> 
> http://wiki.apache.org/spamassassin/CorpusCleaning
> 
> Which part of that page made you feel you should strip out facebook?
> 

http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29


-- 

You're ugly and your mother dresses you funny.

Re: "jarif" corpus on Spamassassin masschecks

Posted by da...@chaosreigns.com.

On 07/02, Jari Fredriksson wrote:
> I follow the wiki page. I have now implemented the following

It seems you are interpreting the wiki as a flawless authority, when it
would probably be more appropriate to consider it a crufty guideline that
one of us should get around to updating.

http://wiki.apache.org/spamassassin/CorpusCleaning

Which part of that page made you feel you should strip out facebook?

-- 
"theres a lot more to life than chicks
none of it matters but theres a lot of it"
- LeRoy, #motorcycles, #EFNet, 7/18/06
http://www.ChaosReigns.com

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On 2.7.2012 5:27, John Hardin wrote:
> On Sun, 1 Jul 2012, darxus@chaosreigns.com wrote:
> 
>> On 07/01, Jari Fredriksson wrote:
>>> Did re-read wiki about cleaning corpus, and removed all mail from
>>> Facebook
>>> and Linkedin etc. from corpus. Also mail from MAILER-DAEMON and from
>>> ALL_TRUSTED removed.
>>
>> I wouldn't remove the facebook stuff... linkedin seems kind of evil
>> though.
>> But if you got a legit email from facebook, and it hit a blacklist, that
>> was a legit failure of that blacklist, and valuable information.
>> Especially since things like sought have a bad habit of inappropriately
>> causing stuff from facebook to get flagged as spam.
>>
>> Removing MAILDER-DAEMON and ALL_TRUSTED stuff is probably fine.
> 
> I'd mildly disagree. Having ALL_TRUSTED hams is useful for FP analysis
> and prevention, and having an ALL_TRUSTED spam is equally valuable.
> ALL_TRUSTED means "not forged", not "not spam".
> 

I follow the wiki page. I have now implemented the following

function remove-unwanted-mail
{
    echo "$0: removing unwanted $1 mail from corpus"
    for file in `egrep -l -m 1
"^List-id\:|^(Reply-To|From|Received)\:.*(uusisuomi\.fi|talentum\.com|linkedin\.com|hs\.fi|iltalehti\.fi|nytimes\.com|facebook\.com|facebookmail\.com)|^Delivered-To:
washingtonpost@fred.*\.fi|^From\: .*MAILER-DAEMON" \
                  Maildir/.Confirmed-$1/cur/*`
    do
      if test -f "$file"; then
        echo -n "Removing $file... "
        rm "$file" || exit 1
        echo "done."
      fi
    done

    for file in `grep ALL_TRUSTED
masscheckwork/*_mass_check/masses/*am-jarif.log | awk '{print $3}'`
    do
      if test -f "$file"; then
        echo -n "Removing $file... "
        rm "$file" || exit 1
        echo "done."
      fi
    done
}

remove-unwanted-mail HAM
remove-unwanted-mail SPAM


-- 

Your analyst has you mixed up with another patient.  Don't believe a
thing he tells you.

Re: "jarif" corpus on Spamassassin masschecks

Posted by John Hardin <jh...@impsec.org>.

On Sun, 1 Jul 2012, darxus@chaosreigns.com wrote:

> On 07/01, Jari Fredriksson wrote:
>> Did re-read wiki about cleaning corpus, and removed all mail from Facebook
>> and Linkedin etc. from corpus. Also mail from MAILER-DAEMON and from
>> ALL_TRUSTED removed.
>
> I wouldn't remove the facebook stuff... linkedin seems kind of evil though.
> But if you got a legit email from facebook, and it hit a blacklist, that
> was a legit failure of that blacklist, and valuable information.
> Especially since things like sought have a bad habit of inappropriately
> causing stuff from facebook to get flagged as spam.
>
> Removing MAILDER-DAEMON and ALL_TRUSTED stuff is probably fine.

I'd mildly disagree. Having ALL_TRUSTED hams is useful for FP analysis and 
prevention, and having an ALL_TRUSTED spam is equally valuable. 
ALL_TRUSTED means "not forged", not "not spam".

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Justice is justice, whereas "social justice" is code for one set
   of rules for the rich, another for the poor; one set for whites,
   another set for minorities; one set for straight men, another for
   women and gays. In short, it's the opposite of actual justice.
                                                     -- Burt Prelutsky
-----------------------------------------------------------------------
  3 days until the 236th anniversary of the Declaration of Independence

Re: "jarif" corpus on Spamassassin masschecks

Posted by da...@chaosreigns.com.

On 07/01, Jari Fredriksson wrote:
> Did re-read wiki about cleaning corpus, and removed all mail from Facebook
> and Linkedin etc. from corpus. Also mail from MAILER-DAEMON and from
> ALL_TRUSTED removed.

I wouldn't remove the facebook stuff... linkedin seems kind of evil though.
But if you got a legit email from facebook, and it hit a blacklist, that
was a legit failure of that blacklist, and valuable information.
Especially since things like sought have a bad habit of inappropriately
causing stuff from facebook to get flagged as spam.  

Removing MAILDER-DAEMON and ALL_TRUSTED stuff is probably fine.  

-- 
"Everything is sacred to us...so if you are sacred then you must treat
yourself with respect, to do otherwise is to desecrate something that
is holy." - ST:TNG 7x20 Journey's End
http://www.ChaosReigns.com

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On Sun, July 1, 2012 06:48, Jari Fredriksson wrote:
>> You shouldn't remove the old SA hits from your corpus.  Those hits are
>> used by reuse rules and are critical for proper accuracy of some rules.
>>
>> Michael
>>
>
> OK. So I keep them and post the requested IP:s to João
>

Did re-read wiki about cleaning corpus, and removed all mail from Facebook
and Linkedin etc. from corpus. Also mail from MAILER-DAEMON and from
ALL_TRUSTED removed.

Re: "jarif" corpus on Spamassassin masschecks

Posted by Jari Fredriksson <ja...@iki.fi>.

On 1.7.2012 6:23, Michael Parker wrote:
>> I had false alarms in my corpus, thanks for posting me this query!
>> > 
>> > 1. They were mostly old mails from WorldOfTanks.eu and Facebook.com.
>> > They did trigger apparently RCVD_IN_MSPIKE_BL in 2011 and early this year.
>> > 
>> > 2. None of them trigger it now.
>> > 
>> > 3. I have to to remove old SpamAssassin traces from all of my corpus. I
>> > had thought that SA does it automatically when doing masscheck, but I
>> > was wrong! I even asked about it in SA dev mailing list, but got no
>> > answer and made a bad decision to leave the markup to the files.
>> > 
> You shouldn't remove the old SA hits from your corpus.  Those hits are used by reuse rules and are critical for proper accuracy of some rules.
> 
> Michael
> 

OK. So I keep them and post the requested IP:s to João

But RCVD_IN_MSPIKE does not trigger with current tests anyway.

-- 

I have never let my schooling interfere with my education.
		-- Mark Twain