You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "Dan Mahoney, System Admin" <da...@prime.gushi.org> on 2014/04/20 21:14:37 UTC

sa-learn from a cronjob?

All,

Most of my users aren't command-line friendly.  I'd like to basically have 
my IMAP server default to handing out two imap mailboxes that get 
auto-crontabbed to training bayes.

Ideally, I'd also like to make it so that things dropped in the learn_spam 
folder are deleted, and stuff in the learn_ham folder (mistake-based 
training) are de-tagged and moved back to the inbox.  Alternatively, a 
single "learned" folder would do.

Perl's Mail::Box seems like a heavy tool for this simple task.  Does 
anyone else have any recommendations?

-Dan

-- 


--------Dan Mahoney--------
Techie,  Sysadmin,  WebGeek
Gushi on efnet/undernet IRC
ICQ: 13735144   AIM: LarpGM
Site:  http://www.gushi.org
---------------------------


Re: sa-learn from a cronjob?

Posted by Daniel Staal <DS...@usa.net>.
--As of April 20, 2014 12:14:37 PM -0700, Dan Mahoney, System Admin is 
alleged to have said:

> Most of my users aren't command-line friendly.  I'd like to basically
> have my IMAP server default to handing out two imap mailboxes that get
> auto-crontabbed to training bayes.
>
> Ideally, I'd also like to make it so that things dropped in the
> learn_spam folder are deleted, and stuff in the learn_ham folder
> (mistake-based training) are de-tagged and moved back to the inbox.
> Alternatively, a single "learned" folder would do.
>
> Perl's Mail::Box seems like a heavy tool for this simple task.  Does
> anyone else have any recommendations?

--As for the rest, it is mine.

You might find this script helpful:
<https://github.com/DanStaal/Arcfind>

I wrote it ages ago for my own use to help in doing basically what you are 
asking for.  I found that my IMAP server had a bad habit of auto-deleting 
newly emptied directories, so I wanted to always leave at least one message 
in the 'learn as spam' folder.

I use it with Maildir folders: the invocation is usually along the lines of 
'mv `arcfind /mail/source/dir/cur/` /mail/dest/dir/cur/'

It doesn't feed to spamassassin itself, but a separate cronjob of 
'sa-learn' works just fine.

Daniel T. Staal

(I'm planning on putting it on CPAN as well, though I'm still considering 
the name and I need to fix some of the docs.  The main page README is 
correct, I just don't have the module versions documented fully yet.)

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Re: sa-learn from a cronjob?

Posted by Thomas Harold <th...@nybeta.com>.
On 4/20/2014 3:14 PM, Dan Mahoney, System Admin wrote:
> All,
> 
> Most of my users aren't command-line friendly.  I'd like to basically
> have my IMAP server default to handing out two imap mailboxes that get
> auto-crontabbed to training bayes.
> 

We do this, but you *really* need to trust your users to classify things
correctly.  So maybe you only advertise it to your "power" or
wise/discerning users.

In our IMAP setup (Dovecot/Pigeonhole) users all have a "Junk" folder in
the root of their mailbox.  Everybody's mailbox is a separate directory
in the MailDir format (one file per message).

Users that we trust are instructed to create "Junk/TrainAsSpam" and
"Junk/TrainAsHam" folders under "Junk/".  Then they put their
mis-trained messages into those folders.  The daily cron jobs then
inspect message files in those folders and run sa-learn on them.

...

The key bit of the script that we use is:

find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS -exec ${SALEARN}
--ham {} \;

$UD is the path to the user's directory, e.g.:

/var/vmail/example.com/username/Maildir/.Junk.TrainAsNotSpam

$DAYS is the age of the messages to look at, typically a value of 3 days
works fine if you run this daily.

$SALEARN is simply the path to the sa-learn command "/usr/bin/sa-learn"

...

Naturally, I make no warranty for fitness of purpose of the attached
scripts.  Nor is this the only way to skin the cat.

Re: sa-learn from a cronjob?

Posted by Benny Pedersen <me...@junc.eu>.
Dan Mahoney, System Admin skrev den 2014-04-20 21:14:

> Perl's Mail::Box seems like a heavy tool for this simple task.  Does
> anyone else have any recommendations?

if you are happy with dovecot, you can use dovecot-antispam plugin to 
handle this live learning for all users that is afraid of console tools 
:=)

just one minor with that is its done live, so prepere to relearn +1 msg 
live

other then that it works nicely even with dovecot 1.x, remember to 
update to dovecot 2.x as soon as possible

Re: sa-learn from a cronjob?

Posted by RW <rw...@googlemail.com>.
On Wed, 30 Apr 2014 13:52:52 -0600
Bob Proulx wrote:


> 
> The maildir exists and a cron script can be used to scan and process
> mail incoming there.  People do it.  It works.  Saying it does not
> work or is not sensible is just wrong mean talk.  People do this all
> of the time.

So do I. I haven't said it's wrong to train from Maildir, the issue is
that Ian's script trains from the new/ directory and then moves it to
cur/. 

When you train from a Maildir that's accessed from a client you need to
train from the cur/ directory (or both new/ and cur/), and either leave
it in that Maildir or move it to another. 

>   Ian does it.  I do it.


I don't know how many times I have to repeat this before you
understand, but Ian doesn't use the conventional training folders
that the OP was asking about. His scheme is reliable because his
folders aren't opened in IMAP or an MUA. He acknowledged this in his
own reply to my original comment.

When I looked at Ian's script it was obvious that it either didn't
work for Ian or he was holding something back that the OP needed to
know. Either way a clarification seemed important. 

You accuse me of being negative, but all I've done is present technical
issues. You've made little effort to understand them or address my
points on a technical level, you've spent a lot of time misrepresenting
my motives and accusing me of being confused. Which of us is being
negative?

  







Re: sa-learn from a cronjob?

Posted by Bob Proulx <bo...@proulx.com>.
RW wrote:
> Bob Proulx wrote:
> > The script is looping through mail files in a maildir and processing
> > them remotely on the server through sa-learn.  After processing the
> > messages it is moving the messages to mark them as having been read.
> 
> No, the Maildir spec defines the "S" flag in the info field for marking
> mail as read (seen), the new/ to cur/ move  is done by an IMAP server
> (or a local Unix client) in the first session that sees the new mail. 
> 
> Copying an email into an IMAP folder via IMAP will not put it into the
> new/ sub-directory of the underlying maildir. Opening a folder in IMAP
> will empty the new/ sub-directory.
> 
> If you don't believe this, I suggest you actually try it on a real
> IMAP server.   I just tried it on Dovecot, and I found it behaves as I
> expected. Newly delivered mail is moved to cur/ when a client is first
> informed about it, copied mail goes to cur/ in the destination mailbox. 

Hmm...  Works for me.  Apparently it works for Ian.  YMMV.

Personally my process removes mail from incoming spam-new folder and
then saves it into the processed spam folder.  That is the way I
prefer to run it.  I use two folders rather than one.  Again YMMV.
Works for me.  Sorry if it does not work for you.

> > > You might have mentioned that because it means it's not the
> > > solution you implied when you wrote "Here is my cronjob for that
> > > purpose". It's certainly not appropriate to users that don't like
> > > the command line.
> > 
> > Sorry but you are incorrect.  Users of Ian's system need not use the
> > command line.  His solution directly answered the Dan's question.
> 
> No, he said himself that my objections don't apply because it's an
> isolated mailbox that's not read by anything except the cron script. A
> macro in the client places the mail directly into the mailbox (bypassing
> the client's conventional mailbox handling) - this is really only even
> remotely sensible for a local instance of mutt, emacs etc.

I think you are completely misunderstanding how this type of process
works.  And I can't avoid saying that this seems intentional by the
tone.  Sorry.  But that is the way it reads to me.  Have tried to help
in good faith but if that good faith is not reciprocated then I am
going to lose interest very quickly.

But let me try again very briefly one last time anyway since I am an
incorrigible optimist.  Two things are very common.  IMAP servers.
Use of maildir.  One does not require the other.  But they very often
appear together.  It is not required to use mutt or emacs or other of
the traditional email clients for this even if that is a typical
desired developer environment.  All that is required for this type of
scripted method is that the backend use maildirs for mail storage.
That way the files can be scanned and processed offline.  I dare say
that most of the masses use web email clients these days.  Or if not
most then a very large number.  They will never see the maildir.

Since use of maildirs is typical for an IMAP server it means that any
of the plethora of imap clients, including web email interfaces to
imap, can be used to interact with the imap server and through that
the maildir folders on the backend.  A user running an imap client
might never see the maildir.  A user running a web mail client would
certainly never see a maildir.  That doess not mean that the maildir
does not exist.  That does not mean that the maildir cannot be scanned
and processed offline for background training of the Bayes database.

The maildir exists and a cron script can be used to scan and process
mail incoming there.  People do it.  It works.  Saying it does not
work or is not sensible is just wrong mean talk.  People do this all
of the time.  Ian does it.  I do it.  Meanwhile no one is disputing
that there are better ways to do things.  There are always better
ways.  Which is why it is so much appreciated when people share.  Then
we can all learn and move forward.  But what can be said when someone
says that something people are doing and making good use of is not
sensible?  I think I will choose to say nothing more.

> Mostly, it's pretty trivial to train Bayes from Maildir, but there
> is one significant complication, and that's that moving mail between
> Maildirs after training may break IMAP keywords, which some clients
> use for custom flags or for sharing proprietary metadata between
> separate client instances. 

Yes it is pretty trivial.  Which has been the topic of this thread.
Simple scripts to scan and process maildirs.  Here you point out some
likely valid issues of breaking tags.  However maintaining tags for
spam messages moved into the training folder isn't a problem that I
find compelling.  Certainly not compelling enough to not do it.

I look forward to reading your positive contribution to the anti-spam
effort.

Bob

-- 
  http://xkcd.com/386/

Re: sa-learn from a cronjob?

Posted by RW <rw...@googlemail.com>.
On Thu, 24 Apr 2014 14:37:52 -0600
Bob Proulx wrote:

> RW wrote:
> > Ian Zimmerman wrote:
> > > RW wrote:
> > > RW> I don't think it will work for the purpose mentioned, and if
> > > RW> it's working properly for you, there's a lot you're not
> > > RW> mentioning.
> 
> I looked at the script and it looks like an example that would work
> for Ian fine.
> ...
> 
> > > RW> It's only looking for mail in the immediate post-delivery
> > > RW> state after it's been put into the mailbox by an MTA or MDA
> > > RW> and before it's been detected as new mail by an MUA (directly
> > > RW> or via IMAP). It wont learn mail put into the folders by an
> > > RW> MUA or IMAP at all.
> 
> No.  That isn't what the script is doing.
> 
> The script is looping through mail files in a maildir and processing
> them remotely on the server through sa-learn.  After processing the
> messages it is moving the messages to mark them as having been read.

No, the Maildir spec defines the "S" flag in the info field for marking
mail as read (seen), the new/ to cur/ move  is done by an IMAP server
(or a local Unix client) in the first session that sees the new mail. 

Copying an email into an IMAP folder via IMAP will not put it into the
new/ sub-directory of the underlying maildir. Opening a folder in IMAP
will empty the new/ sub-directory.

If you don't believe this, I suggest you actually try it on a real
IMAP server.   I just tried it on Dovecot, and I found it behaves as I
expected. Newly delivered mail is moved to cur/ when a client is first
informed about it, copied mail goes to cur/ in the destination mailbox. 

> > You might have mentioned that because it means it's not the
> > solution you implied when you wrote "Here is my cronjob for that
> > purpose". It's certainly not appropriate to users that don't like
> > the command line.
> 
> Sorry but you are incorrect.  Users of Ian's system need not use the
> command line.  His solution directly answered the Dan's question.

No, he said himself that my objections don't apply because it's an
isolated mailbox that's not read by anything except the cron script. A
macro in the client places the mail directly into the mailbox (bypassing
the client's conventional mailbox handling) - this is really only even
remotely sensible for a local instance of mutt, emacs etc.


Mostly, it's pretty trivial to train Bayes from Maildir, but there
is one significant complication, and that's that moving mail between
Maildirs after training may break IMAP keywords, which some clients
use for custom flags or for sharing proprietary metadata between
separate client instances. 







Re: sa-learn from a cronjob?

Posted by Bob Proulx <bo...@proulx.com>.
RW wrote:
> Ian Zimmerman wrote:
> > RW wrote:
> > RW> I don't think it will work for the purpose mentioned, and if it's
> > RW> working properly for you, there's a lot you're not mentioning.

I looked at the script and it looks like an example that would work
for Ian fine.  There are some points of shell programming style that I
would like to avoid seeing propagated in an example though. :-)  But I
think that it is great that Ian shared his script just the same.  This
is one of those things where if ten of us showed all of our working
examples that we would have 12 different scripts.

The biggest thing that hurts Ian's script as a general example is that
it is using ssh to connect to the server running spamassassin.  Most
developers use ssh every day and so that is very normal.  But most of
the masses of email users will not be in a position to use ssh
effectively.  A mail adminstrator would be able to see the example for
what it is and then write that part differently though.

> > RW> It's only looking for mail in the immediate post-delivery state
> > RW> after it's been put into the mailbox by an MTA or MDA and before
> > RW> it's been detected as new mail by an MUA (directly or via IMAP).
> > RW> It wont learn mail put into the folders by an MUA or IMAP at all.

No.  That isn't what the script is doing.

The script is looping through mail files in a maildir and processing
them remotely on the server through sa-learn.  After processing the
messages it is moving the messages to mark them as having been read.

The script is obviously meant to be run periodically by cron.  At that
time it will walk through every message that has been stored into the
ham and spam mailboxes.  A user would only need to store the message
into the appropriate mailbox.  A spam message into the spam mailbox
and then later in the background the cron task will send the spam
message through sa-learn --spam for learning.  Same for --ham.  The
script is fairly obvious, straight forward, and brute force.

> > RW> You need to use separate destination mailboxes.
> > 
> > These are _not_ general purpose Maildirs.  The normal mail processing
> > pipe (MTA -> LDA -> IMAP -> MUA) knows nothing about them.  To mark
> > something as spam/ham, a user (me) executes a custom macro in the MUA
> > which pipes the message through the safecat command to "deliver" it
> > explicitly to one of these directories. 
> 
> You might have mentioned that because it means it's not the solution you
> implied when you wrote "Here is my cronjob for that purpose". It's
> certainly not appropriate to users that don't like the command line.

Sorry but you are incorrect.  Users of Ian's system need not use the
command line.  His solution directly answered the Dan's question.

Dan Mahoney wrote:
> I'd like to basically have my IMAP server default to handing out two
> imap mailboxes that get auto-crontabbed to training bayes.

Ian Zimmerman wrote:
> Here is my cronjob for that purpose, in its entirety.  Note that
> each of ~/spam-corpora{ham,spam} is a Maildir.  There is a small
> race condition between the sa-learn run and the move to cur, which
> wasn't worth fixing in my case; if you use this and fix it let me
> know :)

Which is exactly what his script does.  (I don't like the
implementation as written because the shell scripting has some rough
spots.  But...)

> > Basically, Maildir is just a convenient container format here.  It
> > could be a database or whatever.
> > 
> > Does that answer your objections?
> 
> A Maildir isn't any more convenient than two simple directories. It
> doesn't really matter if you are the only user, but in general putting
> a Maildir that mustn't be opened in home directories wouldn't be a
> very good idea.

I am having a hard time understanding what you are objecting to here.
Dan was the one with the question.  Ian shared something that would do
the task.  It looks like you are having a hard time understanding how
this worked.  If so then please ask questions so as to understand it.
It doesn't make sense to gripe about it without reason.  Sharing and
commenting and peer review and iterating a solution and improving it
is how community efforts work and succeed and grow.

Your comment that a maildir isn't better than two simple directories
implies that you are not familiar with the maildir mailbox format.
Maildir is an ad-hoc standard mailbox format used by most imap
servers.  Using maildir mailboxes would definitely be better than
using two simple directories.  Standard is better than better!

There isn't any reason that it "mustn't be opened".  In fact the
opposite.  The user must be able to open the mailbox and must be able
to save misclassified messages there for learning.  If they do that by
mistake then they can pull the message back out before the crontask
runs.  (That timing is one of my issues with the script that I would
want to see improved.)

Using a maildir for these two purposes makes a lot of sense.  The user
reading email using any of the popular ways to read email these days
then can simply save the message into the appropriate mailbox.  That
could be an imap client or a web mail browser.  If they get a spam
message they can simply save the message into the spam mailbox.  Then
Ian's process is to use a cron task to periodically send all email
that has been saved into the spam mailbox through to sa-learn --spam
on the server training the SpamAssasin Bayes engine on the message.
And the opposite for non-spam for misclassified messages.  For the end
mail reading user no command line knowledge is needed.  They simply
need to be able to save email into mail folders.  Simple for them.
All of the effort is in the backend on the mail server.  Would work
for a large number of users.

Bob

Re: sa-learn from a cronjob?

Posted by RW <rw...@googlemail.com>.
On Thu, 24 Apr 2014 09:29:21 -0700
Ian Zimmerman wrote:

> On Thu, 24 Apr 2014 15:07:32 +0100
> RW <rw...@googlemail.com> wrote:
> 
> RW> I don't think it will work for the purpose mentioned, and if it's
> RW> working properly for you, there's a lot you're not mentioning.
> 
> RW> It's only looking for mail in the immediate post-delivery state
> RW> after it's been put into the mailbox by an MTA or MDA and before
> RW> it's been detected as new mail by an MUA (directly or via IMAP).
> RW> It wont learn mail put into the folders by an MUA or IMAP at all.
> 
> RW> You need to use separate destination mailboxes.
> 
> These are _not_ general purpose Maildirs.  The normal mail processing
> pipe (MTA -> LDA -> IMAP -> MUA) knows nothing about them.  To mark
> something as spam/ham, a user (me) executes a custom macro in the MUA
> which pipes the message through the safecat command to "deliver" it
> explicitly to one of these directories. 

You might have mentioned that because it means it's not the solution you
implied when you wrote "Here is my cronjob for that purpose". It's
certainly not appropriate to users that don't like the command line.


>  Basically, Maildir is just a
> convenient container format here.  It could be a database or whatever.
> 
> Does that answer your objections?

A Maildir isn't any more convenient than two simple directories. It
doesn't really matter if you are the only user, but in general putting
a Maildir that mustn't be opened in home directories wouldn't be a
very good idea.

Re: sa-learn from a cronjob?

Posted by Ian Zimmerman <it...@buug.org>.
On Thu, 24 Apr 2014 15:07:32 +0100
RW <rw...@googlemail.com> wrote:

RW> I don't think it will work for the purpose mentioned, and if it's
RW> working properly for you, there's a lot you're not mentioning.

RW> It's only looking for mail in the immediate post-delivery state
RW> after it's been put into the mailbox by an MTA or MDA and before
RW> it's been detected as new mail by an MUA (directly or via IMAP). It
RW> wont learn mail put into the folders by an MUA or IMAP at all.

RW> You need to use separate destination mailboxes.

These are _not_ general purpose Maildirs.  The normal mail processing
pipe (MTA -> LDA -> IMAP -> MUA) knows nothing about them.  To mark
something as spam/ham, a user (me) executes a custom macro in the MUA
which pipes the message through the safecat command to "deliver" it
explicitly to one of these directories.  Basically, Maildir is just a
convenient container format here.  It could be a database or whatever.

Does that answer your objections?

-- 
Please *no* private copies of mailing list or newsgroup messages.

gpg public key: 2048R/984A8AE4
fingerprint: 7953 ADA1 0E8E AB57 FB79  FFD2 360A 88B2 984A 8AE4
Funny pic: http://bit.ly/ZNE2MX

Re: sa-learn from a cronjob?

Posted by RW <rw...@googlemail.com>.
On Wed, 23 Apr 2014 19:15:13 -0700
Ian Zimmerman wrote:

> On Sun, 20 Apr 2014 12:14:37 -0700 (PDT)
> "Dan Mahoney, System Admin" <da...@prime.gushi.org> wrote:
> 
> > Most of my users aren't command-line friendly.  I'd like to
> > basically have my IMAP server default to handing out two imap
> > mailboxes that get auto-crontabbed to training bayes.
> 
> Here is my cronjob for that purpose, in its entirety.  

I don't think it will work for the purpose mentioned, and if it's
working properly for you, there's a lot you're not mentioning.

It's only looking for mail in the immediate post-delivery state after
it's been put into the mailbox by an MTA or MDA and before it's
been detected as new mail by an MUA (directly or via IMAP). It wont
learn mail put into the folders by an MUA or IMAP at all.

You need to use separate destination mailboxes.


Re: sa-learn from a cronjob?

Posted by Bob Proulx <bo...@proulx.com>.
Ian Zimmerman wrote:
> Here is my cronjob for that purpose, in its entirety.  Note that each of
> ~/spam-corpora{ham,spam} is a Maildir.  There is a small race condition
> between the sa-learn run and the move to cur, which wasn't worth fixing
> in my case; if you use this and fix it let me know :)

I looked over your script.  I think the use of the ssh for remote
processing will probably make it less available to most people.  You
might consider setting up spamd and spamc for this purpose instead.

Also, to give people a known time to react to mistakes it is nice to
not process email immediately but to specify some time such as five
minutes after saving it or some such.  I use find with a ! -newerct "5
minutes ago" to process messages older than five minutes.  That way if
I save something by mistake I have a few minutes to react and remove
the message from the learning.

Instead of mv I have used safecat for moving messages around.  And
generally I avoid worrying about whitespace in filenames for this
since I am guaranteed the file names are well formed without any
whitespace.

Instead of:

        for m in `ls ~/spam-corpora/${food}/new` ; do
            cat ~/spam-corpora/${food}/new/${m} | formail
        done | ssh $server sa-learn --${food} --mbox -

I would suggest something more along the lines of this different and
not not equivalent but similar script.

  cd $MAILBOXDIR || exit 1
  for f in $(find spam-new/new spam-new/cur -ignore_readdir_race -type f ! -newerct "6 minutes ago" -print); do

    spamc -x -d $server --learntype=spam < "$f"
    rc=$?
    if [ $rc -eq 0 ] || [ $rc -eq 98 ]; then
      # rc=98: This appears to be the return (undocumented) when spamc
      # can't learn the message because it is already learned.  The
      # docs say that EX_TOOBIG 98 is not otherwise used.
      if safecat spam/tmp spam/cur < $f >/dev/null; then
        rm -f $f
      fi
    else
      echo "sa-learn failed $rc on $f"
    fi

  done

Perhaps the comments about spamc return code 98 would cause someone
here to look at that part of the code.  It has been years since I put
in that comment.  Perhaps it is even different now.  Don't know.

I have thought about refactoring this into two scripts so that the
find could -exec the second.  That would eliminate the for f in
arguments syntax which would save memory.  But the memory use is small
for my case, I do not need to worry about filenames with whitespace,
and I like having one script instead of two so that I can see everything.

Something to think about.  The above is not in its entirety because I
cut it down from a larger case that is doing other things.  It would
need a little work.  But it might give some ideas.

Bob

Re: sa-learn from a cronjob?

Posted by Ian Zimmerman <it...@buug.org>.
On Sun, 20 Apr 2014 12:14:37 -0700 (PDT)
"Dan Mahoney, System Admin" <da...@prime.gushi.org> wrote:

> Most of my users aren't command-line friendly.  I'd like to basically
> have my IMAP server default to handing out two imap mailboxes that
> get auto-crontabbed to training bayes.

Here is my cronjob for that purpose, in its entirety.  Note that each of
~/spam-corpora{ham,spam} is a Maildir.  There is a small race condition
between the sa-learn run and the move to cur, which wasn't worth fixing
in my case; if you use this and fix it let me know :)

-- 
Please *no* private copies of mailing list or newsgroup messages.

gpg public key: 2048R/984A8AE4
fingerprint: 7953 ADA1 0E8E AB57 FB79  FFD2 360A 88B2 984A 8AE4
Funny pic: http://bit.ly/ZNE2MX

Re: sa-learn from a cronjob?

Posted by James Michael Keller <jm...@houseofzen.org>.
On 04/20/2014 03:14 PM, Dan Mahoney, System Admin wrote:
> All,
>
> Most of my users aren't command-line friendly.  I'd like to basically 
> have my IMAP server default to handing out two imap mailboxes that get 
> auto-crontabbed to training bayes.
>
> Ideally, I'd also like to make it so that things dropped in the 
> learn_spam folder are deleted, and stuff in the learn_ham folder 
> (mistake-based training) are de-tagged and moved back to the inbox.  
> Alternatively, a single "learned" folder would do.
>
> Perl's Mail::Box seems like a heavy tool for this simple task. Does 
> anyone else have any recommendations?
>
> -Dan
>

I have three Maildir folders set up which IMAP point do for the users, 
and generate those with /etc/skel for new users. SPAM/Spam-Missed is for 
user selected spam that wasn't sent to SPAM/Spam-Mail folder, 
SPAM/Spam-Mail is for SA marked spam and is moved with a default 
procmail file also from /etc/skel. SPAM/Spam-Ham is user selected 
non-SPAM messages that made it to SPAM/Spam-Mail.

Then I have a folder config file set up for sa-learn to load:

root@omega:/etc/spamassassin# cat sa-learn-folders.conf
spam:dir:/home/*/Maildir/.SPAM.Spam-Missed/{cur,new}
spam:dir:/home/*/Maildir/.SPAM.Spam-Mail/{cur,new}
ham:dir:/home/*/Maildir/.SPAM.Spam-Ham/{cur,new}
root@omega:/etc/spamassassin#

Then I have a cron call to sa-learn

root@omega:/etc/cron.d# more sa-learn
# Cron entry for sa-learn
MAILTO=root@localhost
0 *    * * *     root /usr/bin/sa-learn --username=Debian-exim --no-sync 
--dbpath=/var/spool/exim4/.spamassassin/bayes -
-folders=/etc/spamassassin/sa-learn-folders.conf >> 
/var/log/sa-learn-run.log
root@omega:/etc/cron.d#

--
-James

Re: sa-learn from a cronjob?

Posted by John Hardin <jh...@impsec.org>.
On Sun, 20 Apr 2014, Dan Mahoney, System Admin wrote:

> All,
>
> Most of my users aren't command-line friendly.  I'd like to basically have my 
> IMAP server default to handing out two imap mailboxes that get 
> auto-crontabbed to training bayes.
>
> Ideally, I'd also like to make it so that things dropped in the learn_spam 
> folder are deleted, and stuff in the learn_ham folder (mistake-based 
> training) are de-tagged and moved back to the inbox.  Alternatively, a single 
> "learned" folder would do.
>
> Perl's Mail::Box seems like a heavy tool for this simple task.  Does anyone 
> else have any recommendations?
>
> -Dan

Warning/recommendation: do NOT learn from user-supplied FP/FN reports 
without manual review, unless you really trust the individual user's 
judgement and responsibility.

Far too often users will do things like drop stuff they actually *did* 
intentionally subscribe to (or that is from a vendor they *have* bought 
things from) into the spam-training folder rather than unsubscribing 
properly when they lose interest.

For the learn-ham folder, train your users to *copy* the message from 
their spam quarantine folder to their learn-ham folder and then *move* it 
back to their inbox. That way you don't need to re-deliver it to their 
inbox via scripting, they've already done that if they want to keep a copy 
of the message.

If you still want to script that instead, take a look at "formail". It's 
part of the procmail suite if you have that installed.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Warning Labels we'd like to see #1: "If you are a stupid idiot while
  using this product you may hurt yourself. And it won't be our fault."
-----------------------------------------------------------------------
  3 days until Max Planck's 156th birthday

RE: sa-learn from a cronjob?

Posted by David Jones <dj...@ena.com>.
fetchmail works well.  Use procmail if needed.
________________________________________
From: Dan Mahoney, System Admin <da...@prime.gushi.org>
Sent: Sunday, April 20, 2014 2:14 PM
To: users@spamassassin.apache.org
Subject: sa-learn from a cronjob?

All,

Most of my users aren't command-line friendly.  I'd like to basically have
my IMAP server default to handing out two imap mailboxes that get
auto-crontabbed to training bayes.

Ideally, I'd also like to make it so that things dropped in the learn_spam
folder are deleted, and stuff in the learn_ham folder (mistake-based
training) are de-tagged and moved back to the inbox.  Alternatively, a
single "learned" folder would do.

Perl's Mail::Box seems like a heavy tool for this simple task.  Does
anyone else have any recommendations?

-Dan

--


--------Dan Mahoney--------
Techie,  Sysadmin,  WebGeek
Gushi on efnet/undernet IRC
ICQ: 13735144   AIM: LarpGM
Site:  http://www.gushi.org
---------------------------