You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Gregory Zornetzer <ga...@nmrfam.wisc.edu> on 2004/09/24 23:29:18 UTC

pine folder internal data and sa-learn

Hi all,

I recently installed spamcop 3.0.0 onto my unix account on an SGI IRIX 6.5
box.  I'm using perl 5.8.5, and I generally read my email with pine,
though sometimes I'll remotely view it using Evolution through the
machine's IMAP server.

The following is a portion of my .procmailrc file that is used for
spamassassin filtering of my email:

:0fw: spamassassin.lock
* < 80000
| spamassassin

:0:
* ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
mail/spam-definitely

:0:
* ^X-Spam-Status: Yes
mail/spam-probably


I have noticed that the mail the gets into the spam-probably folder
generally doesn't get autolearned by spamassassin.  Also, I've noticed
one message that snuck through the spam filter (it only got a score of 3,
and I haven't gotten enough spams trained in the Bayesian filter to
activate it.)  I would like to train the Bayesian filter with these
messages, so using pine, I put them in a mail folder called spam, and I
run sa-learn on it as follows:
sa-learn --spam --mbox --showdots mail/spam

Generally, I notice that sa-learn processes exactly one more message than
I thought was in the folder.  When I take a look in the folder with a text
edittor, I see that there's a fake message that reads as follows:
---------
>From MAILER-DAEMON Tue Dec  9 23:05:26 2003
Date: Tue, 9 Dec 2003 23:05:26 -0600
From: Mail System Internal Data <MA...@nmrfam.wisc.edu>
Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
X-IMAP: 0945113015 0000000396
Status: RO

This text is part of the internal format of your mail folder, and is not
a real message.  It is created automatically by the mail system software.
If deleted, important folder data will be lost, and it will be re-created
with the data reset to initial values.
---------
I am worried that the Bayesian filter is learning this
folder-internal-data message as spam and that this may skew the results of
the filter in the future.  Note that the folder-internal data message
appears to change when the mailbox is changed, so each time I run
sa-learn, the message will get learned again, and not simply passed over
as an already-learned message.

I've found some other people have asked a similar question in the past,
but I didn't see any good answers to it.  Should I submit a bugzilla
report on this?  Any scripts to automagically strip out this message from
an MBOX file?

Thanks very much,
Greg Zornetzer
gaz at nmrfam dot wisc dot edu


Re: pine folder internal data and sa-learn

Posted by Theodore Heise <th...@heise.nu>.

On Sat, 25 Sep 2004, Theodore Heise wrote:

> I've been pointing sa-learn at Pine mail folders now for over two
> years, and just ignoring the fact it's learning from the Pine folder
> header.  I don't expect to actually get any e-mail resembling it.
> During this time Bayes has always worked very effectively for me.

Well, it occurred to me I could investigate this situation a little
bit more objectively using "spamassassin -t" (test mode).

I typically keep all my Pine mail folders in /home/theo/mail/, with
tagged spam directed to ~/mail/spam.  To train my Bayes, I point
sa-learn at the spam folder, move all tagged spam to an archive
file, and then learn ~/mail/* as spam.  This means the Pine spam
folder header gets looked at first as spam, and then as ham.

I tested the spam folder message after learning as spam, and then
after learning as ham.  I also tested the Pine message heading up
the ~/mail/sent folder.  All three messages hit on the same rules
and gave the same total score (quoted below).  Interestingly, the
actual BAYES_00 score was different for the ~/mail/spam folder
learned as spam, as compared to learned as ham (0.0012 vs. 0.0000).

The difference doesn't seem to be worth the trouble to bother with.

Ted

-- 
Theodore (Ted) Heise     <th...@heise.nu>     Bloomington, IN, USA


[~/mail/spam] Pine header message learned as spam
Content analysis details:   (-5.1 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.7 SUBJ_ALL_CAPS          Subject is all capitals
-3.3 ALL_TRUSTED            Did not pass through any untrusted hosts
 0.1 MISSING_HEADERS        Missing To: header
-2.6 BAYES_00               BODY: Bayesian spam probability is 0 to 1%
                            [score: 0.0012]


[~/mail/spam] Pine header message learned as ham
Content analysis details:   (-5.1 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.7 SUBJ_ALL_CAPS          Subject is all capitals
-3.3 ALL_TRUSTED            Did not pass through any untrusted hosts
 0.1 MISSING_HEADERS        Missing To: header
-2.6 BAYES_00               BODY: Bayesian spam probability is 0 to 1%
                            [score: 0.0000]

[~/mail/sent] Pine header message learned as ham
Content analysis details:   (-5.1 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.7 SUBJ_ALL_CAPS          Subject is all capitals
-3.3 ALL_TRUSTED            Did not pass through any untrusted hosts
 0.1 MISSING_HEADERS        Missing To: header
-2.6 BAYES_00               BODY: Bayesian spam probability is 0 to 1%
                            [score: 0.0000]

Re: pine folder internal data and sa-learn

Posted by Theodore Heise <th...@heise.nu>.
On Sat, 25 Sep 2004, Gregory Zornetzer wrote:
> On Fri, 24 Sep 2004, jdow wrote:
> > From: "Gregory Zornetzer" <ga...@nmrfam.wisc.edu>
> >
> > > Generally, I notice that sa-learn processes exactly one more message than
> > > I thought was in the folder.  When I take a look in the folder with a text
> > > edittor, I see that there's a fake message that reads as follows:
> > > ---------
> > > >From MAILER-DAEMON Tue Dec  9 23:05:26 2003
> > > Date: Tue, 9 Dec 2003 23:05:26 -0600
> > > From: Mail System Internal Data <MA...@nmrfam.wisc.edu>
> >
> > Gregory, I have a cure for that. It's ugly and involved a few dozen lines
> > of C code.

> Ah - thanks for the tip. I going to take a guess and say that it looks
> pretty similar to the  following perl code I just wrote.

I've been pointing sa-learn at Pine mail folders now for over two
years, and just ignoring the fact it's learning from the Pine folder
header.  I don't expect to actually get any e-mail resembling it.
During this time Bayes has always worked very effectively for me.

-- 
Theodore (Ted) Heise     <th...@heise.nu>     Bloomington, IN, USA

Re: pine folder internal data and sa-learn

Posted by Gregory Zornetzer <ga...@nmrfam.wisc.edu>.
Hi jdow,

On Fri, 24 Sep 2004, jdow wrote:

> From: "Gregory Zornetzer" <ga...@nmrfam.wisc.edu>
>
<cut for easy reading>
> >  I would like to train the Bayesian filter with these
> > messages, so using pine, I put them in a mail folder called spam, and I
> > run sa-learn on it as follows:
> > sa-learn --spam --mbox --showdots mail/spam
> >
> > Generally, I notice that sa-learn processes exactly one more message than
> > I thought was in the folder.  When I take a look in the folder with a text
> > edittor, I see that there's a fake message that reads as follows:
> > ---------
> > >From MAILER-DAEMON Tue Dec  9 23:05:26 2003
> > Date: Tue, 9 Dec 2003 23:05:26 -0600
> > From: Mail System Internal Data <MA...@nmrfam.wisc.edu>
>
> Gregory, I have a cure for that. It's ugly and involved a few dozen lines
> of C code.
>
> I use the C code to find the second "^From " in the file. I save
> everything after that including the "From " to ./training/spam_train
> for training. I save everything before that to its original file. I
> arranged to do this with safe saves so data loss won't happen. Once
> I have cleaned out the spam mailbox I run salearn on the spam_train
> mailbox. Finally I append all the spam_train messages to "oldspam",
> delete spam_tain, and touch spam_train so it's present for the next
> round.
>
> I use the same generic code for learning ham as well as spam. I just
> change the input parameters around a little. It's all part of a
> script "satrain" that I run as a cron job once a day.
Makes sense.

>
> For one or two people this is quite satisfactory. For large numbers
> of users an alternative approach might be called for.
Heh, luckily, it's just a single-user install.  Though I get the feeling
that others in my group might start pestering the sysadmin for system-wide
spam protection.

>
> I can send you the source for the "imapstrip" utility I built for
> doing this. (Imap and Ipop3 have the same header file tehse days.)
Ah - thanks for the tip. I going to take a guess and say that it looks
pretty similar to the  following perl code I just wrote. (please excuse my
lack of finesse with  perl coding).  Except that this takes input on stdin
and writes to stdout.


#!/usr/bin/perl
$line = <STDIN>;
if ($line =~ /^From\sMAILER-DAEMON/) {
   do {
        $line = <STDIN>
   } until($line =~ /^From\s/ | $line eq "");
};
print $line;
while(<>) {
   print $_;
}


Guess its time for me to write some sripts.
Thanks,
-Greg


Re: pine folder internal data and sa-learn

Posted by jdow <jd...@earthlink.net>.
From: "Gregory Zornetzer" <ga...@nmrfam.wisc.edu>

> Hi all,
>
> I recently installed spamcop 3.0.0 onto my unix account on an SGI IRIX 6.5
> box.  I'm using perl 5.8.5, and I generally read my email with pine,
> though sometimes I'll remotely view it using Evolution through the
> machine's IMAP server.
>
> The following is a portion of my .procmailrc file that is used for
> spamassassin filtering of my email:
>
> :0fw: spamassassin.lock
> * < 80000
> | spamassassin
>
> :0:
> * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
> mail/spam-definitely
>
> :0:
> * ^X-Spam-Status: Yes
> mail/spam-probably
>
>
> I have noticed that the mail the gets into the spam-probably folder
> generally doesn't get autolearned by spamassassin.  Also, I've noticed
> one message that snuck through the spam filter (it only got a score of 3,
> and I haven't gotten enough spams trained in the Bayesian filter to
> activate it.)  I would like to train the Bayesian filter with these
> messages, so using pine, I put them in a mail folder called spam, and I
> run sa-learn on it as follows:
> sa-learn --spam --mbox --showdots mail/spam
>
> Generally, I notice that sa-learn processes exactly one more message than
> I thought was in the folder.  When I take a look in the folder with a text
> edittor, I see that there's a fake message that reads as follows:
> ---------
> >From MAILER-DAEMON Tue Dec  9 23:05:26 2003
> Date: Tue, 9 Dec 2003 23:05:26 -0600
> From: Mail System Internal Data <MA...@nmrfam.wisc.edu>

Gregory, I have a cure for that. It's ugly and involved a few dozen lines
of C code.

I use the C code to find the second "^From " in the file. I save
everything after that including the "From " to ./training/spam_train
for training. I save everything before that to its original file. I
arranged to do this with safe saves so data loss won't happen. Once
I have cleaned out the spam mailbox I run salearn on the spam_train
mailbox. Finally I append all the spam_train messages to "oldspam",
delete spam_tain, and touch spam_train so it's present for the next
round.

I use the same generic code for learning ham as well as spam. I just
change the input parameters around a little. It's all part of a
script "satrain" that I run as a cron job once a day.

For one or two people this is quite satisfactory. For large numbers
of users an alternative approach might be called for.

I can send you the source for the "imapstrip" utility I built for
doing this. (Imap and Ipop3 have the same header file tehse days.)

{^_^}