You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Sergio Durigan Junior <se...@sergiodj.net> on 2013/11/08 19:09:38 UTC

spamc -L apparently not working properly

Hey there,

I am using Debian Wheezy here (therefore, Exim + Dovecot for e-mail),
and I am still deciding how to run SpamAssassin.  I am divided between
running it by directly calling spamassassin, or by running spamd and
calling spamc.  Both methods are going to be used via my .procmailrc.

Well, but so far I have been testing spamd + spamc because it is the
Debian recommended way.  I still haven't enabled it via .procmailrc, and
just did tests by calling spamc via CLI.  However, I am seeing a strange
behavior when I try to feed spamd with a false-negative message.  Here's
what I am doing:

  #> spamc -c < spam.file
  0.0/5.0
  #> spamc -L spam < spam.file
  (successful message saying that the spam was learned)
  #> spamc -c < spam.file
  0.0/5.0

I have already updated my Bayesian database, restarted the spamd
service, etc.  I was expecting that I'd get a high rate after feeding
the spam to SpamAssassin, but that's not happening.  Any suggestions?

I am running spamd with the following options:

  --create-prefs --max-children 5 --helper-home-dir --allow-tell

And the version I am using is:

  SpamAssassin version 3.3.2
    running on Perl version 5.14.2

Comments and suggestions are appreciated.  Thanks!

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sat, 2013-11-09 at 01:35 -0200, Sergio Durigan Junior wrote:
> On Friday, November 08 2013, Amir Caspi wrote:

> > I would run spamd as root and initiate spamc with the -u option, to allow
> > each user to have his/her own Bayes DB.  However, again, it really depends
> > on what kind of email system you're running, and how you want to handle
> > spam.  If you're running a corporate server, you might prefer a global DB;
> > if you're running a server with personal users whose email characteristics
> > vary widely, you might prefer per-user DBs.  For my setup, I prefer
> > per-user DBs.

You mentioned using SA from procmail, so there usually is no need for
the -u user option (see that other sub-thread about this option).

Running the spamd daemon as root and calling spamc as the receiving user
is an easy way to get per-user Bayes databases. Keep in mind though,
this requires Bayes training per user, and every user needs its own
$HOME or related options.


> Thanks for the opinion.  I was considering doing that, and your message
> was the final word I needed.
> 
> Now everything is setup per-user, and I am feeding the Bayes DB with
> what I have.

What I wrote above was partially triggered by this. Not "the Bayes DB",
which sounds like a single one to me, but one Bayes db per user. Which
requires initial training per user.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: spamc -L apparently not working properly

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Friday, November 08 2013, Amir Caspi wrote:

>> What's your opinion?
>
> I would run spamd as root and initiate spamc with the -u option, to allow
> each user to have his/her own Bayes DB.  However, again, it really depends
> on what kind of email system you're running, and how you want to handle
> spam.  If you're running a corporate server, you might prefer a global DB;
> if you're running a server with personal users whose email characteristics
> vary widely, you might prefer per-user DBs.  For my setup, I prefer
> per-user DBs.

Thanks for the opinion.  I was considering doing that, and your message
was the final word I needed.

Now everything is setup per-user, and I am feeding the Bayes DB with
what I have.

Thanks,

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
On Fri, November 8, 2013 2:56 pm, Sergio Durigan Junior wrote:
> The problem with having a user-tailored database is that I will have to
> run sa-update for every user, right?

No, or at least, not that I've seen.  If spamd is running as root, it will
load the sa-update rules from the root installation
(/var/lib/spamassassin); it will only su to the user when called by spamc,
and then it will only load that user's local Bayes DB and local rules (if
enabled); it doesn't have to load any of the main rules, which are kept in
memory from when spamd was first initiated (and were loaded from the root
installation).  This is also why it's important to restart spamd when
sa-update actually updates rules (the sa-update cron script should do this
for you).

At least, this is how it works on my system, which has a pretty vanilla
install of SA.

Even if your users are running spamassassin versus spamc, it should be
able to read the rules in the root install location, as long as your users
have read permission.  If you're running on a virtual host platform with
multiple chroot environments (e.g. cPanel, Parallels Pro Control Panel,
etc.) then you may need to run sa-update for each environment, but you
should still only need the one root install (and one sa-update command)
for running spamd as root.

> What's your opinion?

I would run spamd as root and initiate spamc with the -u option, to allow
each user to have his/her own Bayes DB.  However, again, it really depends
on what kind of email system you're running, and how you want to handle
spam.  If you're running a corporate server, you might prefer a global DB;
if you're running a server with personal users whose email characteristics
vary widely, you might prefer per-user DBs.  For my setup, I prefer
per-user DBs.

						--- Amir


Re: spamc -L apparently not working properly

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Friday, November 08 2013, Amir Caspi wrote:

> On Fri, November 8, 2013 2:39 pm, Sergio Durigan Junior wrote:
>> I don't think sa-learn can help with spamd.  Its own manpage mention
>> that, for spamd users, "spamc -L" is the way to go.
>>
>> Hm, really?  I thought spamd kept a global Bayes database, and that
>> everyone calling "spamc -L" would end up feeding this database, and not
>> some local one.
>
> It depends on how spamc is called.  If spamd is running as root and spamc
> is called with the -u flag, then spamd will su to the named user, and will
> then use that user's local database (and local prefs, if allow_user_prefs
> is enabled).  spamc -L -u would work on the local database; spamc -L
> (without -u) would work on the database applicable to the spamd user.

My spamd is currently running as root, but I am thinking about changing
it to run using Debian's pre-setup user (debian-spamd).  Unless you guys
have better recommendations.

> It all depends on whether you want your users to have individual databases
> tailored to their own spam/ham, or a global database.

The problem with having a user-tailored database is that I will have to
run sa-update for every user, right?  Currently, Debian provides the
aforementioned spamd user (debian-spamd) and runs sa-update on behalf of
it.  Therefore, I believe using a global database is probably better in
this case.  What's your opinion?

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
On Fri, November 8, 2013 3:24 pm, Karsten Bräckelmann wrote:
> The latter is incorrect -- spamc by default sends the effective user ID,
> and spamd switches users before processing the mail (assuming the daemon
> has been started as root). The -u user option is only necessary to
> change that default.

Whoops, you're perfectly right.  On a system where spamc is run as some
fixed user (e.g. nobody), you need the -u option to get the per-user
options to work correctly.  If spamc is being run as the receiving user
already (e.g. via procmail, barring some weird setuid behavior) then you
don't need the -u option (although it won't break anything if you use it,
it's just unnecessary).

Sorry for the incomplete info.

						--- Amir


Re: spamc -L apparently not working properly

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2013-11-08 at 14:45 -0700, Amir 'CG' Caspi wrote:
> On Fri, November 8, 2013 2:39 pm, Sergio Durigan Junior wrote:
> > I don't think sa-learn can help with spamd.  Its own manpage mention
> > that, for spamd users, "spamc -L" is the way to go.

Fundamentally, there is no difference between sa-learn and spamc -L.


> > Hm, really?  I thought spamd kept a global Bayes database, and that
> > everyone calling "spamc -L" would end up feeding this database, and not
> > some local one.
> 
> It depends on how spamc is called.  If spamd is running as root and spamc
> is called with the -u flag, then spamd will su to the named user, and will
> then use that user's local database (and local prefs, if allow_user_prefs
> is enabled).  spamc -L -u would work on the local database; spamc -L
> (without -u) would work on the database applicable to the spamd user.

The latter is incorrect -- spamc by default sends the effective user ID,
and spamd switches users before processing the mail (assuming the daemon
has been started as root). The -u user option is only necessary to
change that default.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: spamc -L apparently not working properly

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
On Fri, November 8, 2013 2:39 pm, Sergio Durigan Junior wrote:
> I don't think sa-learn can help with spamd.  Its own manpage mention
> that, for spamd users, "spamc -L" is the way to go.
>
> Hm, really?  I thought spamd kept a global Bayes database, and that
> everyone calling "spamc -L" would end up feeding this database, and not
> some local one.

It depends on how spamc is called.  If spamd is running as root and spamc
is called with the -u flag, then spamd will su to the named user, and will
then use that user's local database (and local prefs, if allow_user_prefs
is enabled).  spamc -L -u would work on the local database; spamc -L
(without -u) would work on the database applicable to the spamd user.

It all depends on whether you want your users to have individual databases
tailored to their own spam/ham, or a global database.

						--- Amir


Re: spamc -L apparently not working properly

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2013-11-08 at 20:18 -0200, Sergio Durigan Junior wrote:
> Nice, thank you.  I am more inclined to use a per-user database, and
> call "spamc -u myuser -L spam".  Let's see how that goes.

The real difference between sa-learn and spamc -L is how to feed it.

The spamc way expects a single message on STDIN, which usually is most
applicable for integration with your MUA. It also easily enables mail
storage and SA to be on different machines.

sa-learn expects the message(s) as file name. Requires direct access of
the mail storage, but enables training of entire mail folders with a
single command.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: spamc -L apparently not working properly

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Friday, November 08 2013, John Hardin wrote:

>> I don't think sa-learn can help with spamd.  Its own manpage mention
>> that, for spamd users, "spamc -L" is the way to go.
>
> Not true. sa-learn is just fine for spamd with a global Bayes
> database, and it's recommended for administrative simplicity if you
> have that environment.

Aha, interesting, thanks for explaining.

> Global vs. per-user Bayes databases is a site-specific
> config. However, it should be consistent - spamd should be reading
> from and training to the bayes database of the user running spamc, so
> I don't off the top of my head know why it dosn't appear to be working
> for you.
>
> What are the Bayes database statistics before and after running spamc -L?
> (sa-learn --dump magic)
>
> I use a global database and sa-learn, so I don't have any direct
> experience with spamc -L quirks, sorry. That's why I suggested
> sa-learn.

Nice, thank you.  I am more inclined to use a per-user database, and
call "spamc -u myuser -L spam".  Let's see how that goes.

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by John Hardin <jh...@impsec.org>.
On Fri, 8 Nov 2013, Sergio Durigan Junior wrote:

> On Friday, November 08 2013, John Hardin wrote:
>
>> On Fri, 8 Nov 2013, Sergio Durigan Junior wrote:
>>
>>>  #> spamc -c < spam.file
>>>  0.0/5.0
>>>  #> spamc -L spam < spam.file
>>>  (successful message saying that the spam was learned)
>>>  #> spamc -c < spam.file
>>>  0.0/5.0
>>>
>>> I have already updated my Bayesian database, restarted the spamd
>>> service, etc.  I was expecting that I'd get a high rate after feeding
>>> the spam to SpamAssassin, but that's not happening.  Any suggestions?
>>
>> Try using sa-learn to train Bayes.
>
> I don't think sa-learn can help with spamd.  Its own manpage mention
> that, for spamd users, "spamc -L" is the way to go.

Not true. sa-learn is just fine for spamd with a global Bayes database, 
and it's recommended for administrative simplicity if you have that 
environment.

>> The big thing to keep in mind is that the user running the training
>> needs to be the same user that spamd is running as; if not, depending
>> on your bayes database config, you may be training a different Bayes
>> database than the one spamd is reading.
>
> Hm, really?  I thought spamd kept a global Bayes database, and that
> everyone calling "spamc -L" would end up feeding this database, and not
> some local one.

Global vs. per-user Bayes databases is a site-specific config. However, it 
should be consistent - spamd should be reading from and training to the 
bayes database of the user running spamc, so I don't off the top of my 
head know why it dosn't appear to be working for you.

What are the Bayes database statistics before and after running spamc -L?
(sa-learn --dump magic)

I use a global database and sa-learn, so I don't have any direct 
experience with spamc -L quirks, sorry. That's why I suggested sa-learn.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   From the Liberty perspective, it doesn't matter if it's a
   jackboot or a Birkenstock smashing your face.         -- Robb Allen
-----------------------------------------------------------------------
  3 days until Veterans Day

Re: spamc -L apparently not working properly

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Friday, November 08 2013, John Hardin wrote:

> On Fri, 8 Nov 2013, Sergio Durigan Junior wrote:
>
>>  #> spamc -c < spam.file
>>  0.0/5.0
>>  #> spamc -L spam < spam.file
>>  (successful message saying that the spam was learned)
>>  #> spamc -c < spam.file
>>  0.0/5.0
>>
>> I have already updated my Bayesian database, restarted the spamd
>> service, etc.  I was expecting that I'd get a high rate after feeding
>> the spam to SpamAssassin, but that's not happening.  Any suggestions?
>
> Try using sa-learn to train Bayes.

I don't think sa-learn can help with spamd.  Its own manpage mention
that, for spamd users, "spamc -L" is the way to go.

> The big thing to keep in mind is that the user running the training
> needs to be the same user that spamd is running as; if not, depending
> on your bayes database config, you may be training a different Bayes
> database than the one spamd is reading.

Hm, really?  I thought spamd kept a global Bayes database, and that
everyone calling "spamc -L" would end up feeding this database, and not
some local one.

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by John Hardin <jh...@impsec.org>.
On Fri, 8 Nov 2013, Sergio Durigan Junior wrote:

>  #> spamc -c < spam.file
>  0.0/5.0
>  #> spamc -L spam < spam.file
>  (successful message saying that the spam was learned)
>  #> spamc -c < spam.file
>  0.0/5.0
>
> I have already updated my Bayesian database, restarted the spamd
> service, etc.  I was expecting that I'd get a high rate after feeding
> the spam to SpamAssassin, but that's not happening.  Any suggestions?

Try using sa-learn to train Bayes.

The big thing to keep in mind is that the user running the training needs 
to be the same user that spamd is running as; if not, depending on your 
bayes database config, you may be training a different Bayes database than 
the one spamd is reading.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   From the Liberty perspective, it doesn't matter if it's a
   jackboot or a Birkenstock smashing your face.         -- Robb Allen
-----------------------------------------------------------------------
  3 days until Veterans Day

Re: spamc -L apparently not working properly

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Friday, November 08 2013, John Hardin wrote:

> Not directly addressing your other questions but: running spamassassin
> directly is only really suitable for *very* low-traffic environments,
> as that will parse and compile all of the rules and other config *per
> message*, which is a lot of overhead. spamc+spamd is strongly
> recommended for production use.

Thanks a lot for the input, John.  I guess I will end up using spamd and
spamc, after all.  I'll just wait for the answer to my question, and
then I'll set everything up here.

Regards,

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by John Hardin <jh...@impsec.org>.
On Fri, 8 Nov 2013, Sergio Durigan Junior wrote:

> I am using Debian Wheezy here (therefore, Exim + Dovecot for e-mail),
> and I am still deciding how to run SpamAssassin.  I am divided between
> running it by directly calling spamassassin, or by running spamd and
> calling spamc.  Both methods are going to be used via my .procmailrc.

Not directly addressing your other questions but: running spamassassin 
directly is only really suitable for *very* low-traffic environments, as 
that will parse and compile all of the rules and other config *per 
message*, which is a lot of overhead. spamc+spamd is strongly recommended 
for production use.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The more you believe you can create heaven on earth the more
   likely you are to set up guillotines in the public square to
   hasten the process.                                 -- James Lileks
-----------------------------------------------------------------------
  3 days until Veterans Day

Re: spamc -L apparently not working properly

Posted by John Hardin <jh...@impsec.org>.
On Sat, 9 Nov 2013, Sergio Durigan Junior wrote:

> [Note: By ham I assume you mean false-positives, and not just regular
> e-mail.]

No, Train with correctly-classified ham as well.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...to announce there must be no criticism of the President or to
   stand by the President right or wrong is not only unpatriotic and
   servile, but is morally treasonous to the American public.
                                           -- Theodore Roosevelt, 1918
-----------------------------------------------------------------------
  3 days until Veterans Day

Re: spamc -L apparently not working properly

Posted by David B Funk <db...@engineering.uiowa.edu>.
On Sat, 9 Nov 2013, Sergio Durigan Junior wrote:

> On Saturday, November 09 2013, Karsten Bräckelmann wrote:
>
>> You don't have any kind of archive of spam? If so, train on recent ones,
>> feel free to exceed the minimum limit, but don't bother too much with
>> old spam. It changes much faster over time than ham does.
>>
>> Also, at least until you reached the minimum required training, do train
>> with identified spam, too. Same with ham. For now, keep training in a
>> ratio somewhere between 1:1 or spam to ham ratio.
>
> [Note: By ham I assume you mean false-positives, and not just regular
> e-mail.]
>
> No, (un)fortunately I don't.  I've been running this server for 5 months
> now, and only received about 10 spams so far.  I decided to start
> running SA now because I've received 5 spams in the last 3 days, which
> triggered my internal alarm.
>
>> Do train. Spam, as well as ham. If you got some recent-ish archives.
>
> Will do.  However, I don't have false-positives (ham) to train.  As I
> said above, I only have about 10 spam messages, which I already used to
> train Bayes.  Not sure if it is possible/would be good to search for
> recent spam archives on the net.  I believe not...

For Bayes to work it needs at least 200 examples of Ham (e-mail that
you want) and 200 examples of Spam (e-mail that you don't want).
It doesn't matter if the messages were correctly or not correctly
classified by the rules-based SA engine, just what you consider
Ham/Spam (IE correctly classified by -you-).
In essence you are "teaching" the Bayes system how to recognize
your preferences in e-mail classifying.

So the messages you've kept in your INBOX should be good for Ham.

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: Positive / Negative

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Monday, November 11 2013, Karsten Bräckelmann wrote:

> 'sa-learn --dump magic' still shows less than 200 nham / nspam, right?

Yes, it does.

> Until that issue is resolved, please keep the spam for potential further
> post-receiving tests.

Will certainly do.

> Not strictly SA configuration, but you probably want to change the
> following Debian defaults in /etc/default/spamassassin
>
>   ENABLED=0
>   CRON=0
>
> and enable the spamd daemon system-wide, as well as sa-update.
>
> If you didn't yet run sa-update, do so now. Restart spamd afterward.
> FWIW, this counts as "modifying SA config", since it updates the stock
> rule-set.

Oh, I did that, yeah.  I meant to say that I did not touch in any file
under /etc/spamassassin.  So my /etc/spamassassin/local.cf, for example,
is exactly what is shipped with Debian.

Thanks,

-- 
Sergio

Re: Positive / Negative

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2013-11-11 at 00:34 -0200, Sergio Durigan Junior wrote:
> On Sunday, November 10 2013, Karsten Bräckelmann wrote:
> > Given you state below no spam has been identified yet, you're confusing
> > terms.

Gnah. I was falsely thinking "received" when I wrote "identified" there.

> I don't think I am confusing terms.

True, my bad, sorry.

> > If you prefer, refer to them as missed spam, or (in)correctly classified
> > ham and spam.
> 
> OK, I will make use of those terms if it makes things clearer for you.

That however should not be possible. I guess I am entirely capable of
handling the terms FP and FN... ;)


> Indeed, no spam has been classified at all since I started running SA.
> 
> An interesting fact is that, before I started using SA, I had some spams
> left in my INBOX.  Well, when I decided that it was time to use SA, I
> manually fed those spams to spamc (for testing purposes), and SA
> correctly identified almost all of them!  But now, as I said, SA is
> failing to classify the spam I've been receiving.

'sa-learn --dump magic' still shows less than 200 nham / nspam, right?

> > I suggest to start a new thread (no reply) about this. For starters,
> > we'd need details about your environment and how you set up SA. Plus
> > some X-Spam-Status headers of ham and (missed) spam.
> 
> OK, fair enough.  Unfortunately, I don't have any spam messages left.  I
> used them all to feed sa-learn, and then deleted them.  But as soon as I
> get another misclassified spam, I will start another thread on this
> topic, with all the information requested

Until that issue is resolved, please keep the spam for potential further
post-receiving tests.


> (BTW, I am using a default Debian SA configuration, and did not modify
> anything so far).

Not strictly SA configuration, but you probably want to change the
following Debian defaults in /etc/default/spamassassin

  ENABLED=0
  CRON=0

and enable the spamd daemon system-wide, as well as sa-update.

If you didn't yet run sa-update, do so now. Restart spamd afterward.
FWIW, this counts as "modifying SA config", since it updates the stock
rule-set.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Positive / Negative

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Sunday, November 10 2013, Karsten Bräckelmann wrote:

> On Sun, 2013-11-10 at 03:32 -0200, Sergio Durigan Junior wrote:
>> On Sunday, November 10 2013, Karsten Bräckelmann wrote:
>
>> For all messages that I received since I started using SA (about 20
>> messages, of which 5 were false-negatives, and the rest were
>> true-negatives), [...]
>
> Given you state below no spam has been identified yet, you're confusing
> terms.
>
> SA tests for spam. Thus a positive result is "classified spam", and "not
> spam" is a negative test result. True means the result is correct,
> whereas false indicates a mis-classification by the test.
>
> False (mis-classified) negatives (rated not-spam) are spam, which SA
> failed to classify spam.

I don't think I am confusing terms.

false-negative: spam that got classified as ham
false-positive: ham that got classified as spam
true-negative: ham
true-positive: spam

Maybe my terms aren't the correct ones, and if that's the case, sorry
about it.

> If you prefer, refer to them as missed spam, or (in)correctly classified
> ham and spam.

OK, I will make use of those terms if it makes things clearer for you.

>> I do receive spam.  About 1 or 2 per day.  But so far SA hasn't been
>> able to catch any of them, and all spam I receive has been marked as ham
>> so far.  The message headers are OK, there is nothing apparently wrong
>> with SA, but it is just not catching most of my spam.  I assume this is
>> normal behavior since I just started using SA a few days ago.
>
> No, that is not normal. In fact, since no spam has been identified at
> all yet, there is something really broken or mis-configured.

Indeed, no spam has been classified at all since I started running SA.

An interesting fact is that, before I started using SA, I had some spams
left in my INBOX.  Well, when I decided that it was time to use SA, I
manually fed those spams to spamc (for testing purposes), and SA
correctly identified almost all of them!  But now, as I said, SA is
failing to classify the spam I've been receiving.

> I suggest to start a new thread (no reply) about this. For starters,
> we'd need details about your environment and how you set up SA. Plus
> some X-Spam-Status headers of ham and (missed) spam.

OK, fair enough.  Unfortunately, I don't have any spam messages left.  I
used them all to feed sa-learn, and then deleted them.  But as soon as I
get another misclassified spam, I will start another thread on this
topic, with all the information requested (BTW, I am using a default
Debian SA configuration, and did not modify anything so far).

Thanks,

-- 
Sergio

Positive / Negative (was: Re: spamc -L apparently not working properly)

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sun, 2013-11-10 at 03:32 -0200, Sergio Durigan Junior wrote:
> On Sunday, November 10 2013, Karsten Bräckelmann wrote:

> For all messages that I received since I started using SA (about 20
> messages, of which 5 were false-negatives, and the rest were
> true-negatives), [...]

Given you state below no spam has been identified yet, you're confusing
terms.

SA tests for spam. Thus a positive result is "classified spam", and "not
spam" is a negative test result. True means the result is correct,
whereas false indicates a mis-classification by the test.

False (mis-classified) negatives (rated not-spam) are spam, which SA
failed to classify spam.


If you prefer, refer to them as missed spam, or (in)correctly classified
ham and spam.


> I do receive spam.  About 1 or 2 per day.  But so far SA hasn't been
> able to catch any of them, and all spam I receive has been marked as ham
> so far.  The message headers are OK, there is nothing apparently wrong
> with SA, but it is just not catching most of my spam.  I assume this is
> normal behavior since I just started using SA a few days ago.

No, that is not normal. In fact, since no spam has been identified at
all yet, there is something really broken or mis-configured.

I suggest to start a new thread (no reply) about this. For starters,
we'd need details about your environment and how you set up SA. Plus
some X-Spam-Status headers of ham and (missed) spam.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: spamc -L apparently not working properly

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Sunday, November 10 2013, Karsten Bräckelmann wrote:

> nham is the "Number of HAM" learned, in messages. Same for nspam. Keep
> training until both are at least 200 -- accuracy should improve
> dramatically after that.

I figured that out.

> Keep an eye on the X-Spam-Status header, autolearn bit.
>
> If that happens frequently for FNs, there's a problem somewhere. We'd
> need the X-Spam headers and preferably the full, raw message put up a
> pastebin for debugging. After some initial training.

For all messages that I received since I started using SA (about 20
messages, of which 5 were false-negatives, and the rest were
true-negatives), autolearn seems to be working OK, i.e., when messages
score below the threshold, autolearn works, and when messages score
above the threshold, I see "autolearn=no".

> There's one thing worrying in your comment: "whether false-negative or
> true-negative". You DO have spam also, right? I mean, classified spam is
> not just silently discarded without you ever seeing it? That would be
> really bad at this stage. Take it, verify it, learn it.

I do receive spam.  About 1 or 2 per day.  But so far SA hasn't been
able to catch any of them, and all spam I receive has been marked as ham
so far.  The message headers are OK, there is nothing apparently wrong
with SA, but it is just not catching most of my spam.  I assume this is
normal behavior since I just started using SA a few days ago.

For every spam message that I received, I analyze its headers, verify
that everything is OK with SA, and then feed it to sa-learn.

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sun, 2013-11-10 at 02:39 -0200, Sergio Durigan Junior wrote:
> On Sunday, November 10 2013, Karsten Bräckelmann wrote:

> > > So, I now have yet another question.  I let auto_learn active for SA,
> > > and now for every false-negative SA will learn that it is not spam,
> >
> > No. False negative (not classified spam, although it is) is NOT what
> > triggers auto-learn ham.
> 
> All right, I misunderstood things then.  I assumed that because of
> sa-learn --dump magic output:

>   0.000          0         37          0  non-token data: nham

> And this number increases every time I receive a message (whether it is
> a false-negative or a true-negative).  Since I have too little spam to
> train, it is hard to keep up with the number of ham received.

nham is the "Number of HAM" learned, in messages. Same for nspam. Keep
training until both are at least 200 -- accuracy should improve
dramatically after that.

Keep an eye on the X-Spam-Status header, autolearn bit.

If that happens frequently for FNs, there's a problem somewhere. We'd
need the X-Spam headers and preferably the full, raw message put up a
pastebin for debugging. After some initial training.


There's one thing worrying in your comment: "whether false-negative or
true-negative". You DO have spam also, right? I mean, classified spam is
not just silently discarded without you ever seeing it? That would be
really bad at this stage. Take it, verify it, learn it.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: spamc -L apparently not working properly

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Sunday, November 10 2013, Karsten Bräckelmann wrote:

> On Sun, 2013-11-10 at 01:59 -0200, Sergio Durigan Junior wrote:
>> Nice, thanks both of you for the answers.
>> 
>> I am now feeding SA with ham from my INBOX, while I also feed it with
>> false-negatives (interestingly, I am receiving now *much* more spam than
>> I was a week ago...).
>
> Given what you stated about your spam volume before, entirely possible.
> However, you're not using catch-all, do you?

No, I'm not.

>> So, I now have yet another question.  I let auto_learn active for SA,
>> and now for every false-negative SA will learn that it is not spam,
>
> No. False negative (not classified spam, although it is) is NOT what
> triggers auto-learn ham.

All right, I misunderstood things then.  I assumed that because of
sa-learn --dump magic output:

  ...
  0.000          0         37          0  non-token data: nham
  ...

And this number increases every time I receive a message (whether it is
a false-negative or a true-negative).  Since I have too little spam to
train, it is hard to keep up with the number of ham received.

But I will read the docs and learn how this works.

>> although it is.  I'm now thinking that maybe auto_learn is not a good
>> idea, at least until I have a good enough Bayes database (strangely, SA
>> did not catch *any* spam in the last 48 hours...).  Can you confirm
>> this?
>> 
>> Thanks a lot, and sorry if I'm asking too much :-).
>
> Just leave auto-learn enabled. And, yet again, do train both ham and
> spam (all, not only mis-classified messages) for initial training.

I am already doing that, thanks for the advice.

> Auto-learning in SA Bayes is much more than a pure feedback loop, as you
> described. A message just being classified ham (< 5.0) is NOT learned as
> ham. Neither are messages scored spam (>= 5.0) learned as spam.
>
> (1) The thresholds for auto-learning are 0.1 and 12.0 by default. Not
>     the required_score threshold of 5.0 default.
> (2) Certain rules are not considered for auto-learning, to prevent self-
>     feeding.
> (3) A minimum of header and body rules are required, to prevent biasing.
>
> See M::SA::Plugin::AutoLearnThreshold docs for more details.
>
> Part of the X-Spam-Status header way down the end tells you about SA
> auto-learning or not. Hardly surprising, that's
>   autolearn=(ham|spam|no|unavailable)

Great, thanks a lot for the pointers and the explanation.

> In your case, I'd say just let SA do it's job. Monitor the results, and
> train both ham and spam, at the very least until BAYES_xx rules show up
> in X-Spam-Status headers.
>
> Keep training Bayes after that, to improve performance. Definitely do
> train on false positives and negatives.
>
> Wait, observe, and learn how to read X-Spam headers. :)

Nice, I will keep monitoring everything the way I'm doing.  And I will
definitely read more about the headers and SA in general.

Thanks a lot for the replies and the patience.  It's been very
educational :-).

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sun, 2013-11-10 at 01:59 -0200, Sergio Durigan Junior wrote:
> Nice, thanks both of you for the answers.
> 
> I am now feeding SA with ham from my INBOX, while I also feed it with
> false-negatives (interestingly, I am receiving now *much* more spam than
> I was a week ago...).

Given what you stated about your spam volume before, entirely possible.
However, you're not using catch-all, do you?

> So, I now have yet another question.  I let auto_learn active for SA,
> and now for every false-negative SA will learn that it is not spam,

No. False negative (not classified spam, although it is) is NOT what
triggers auto-learn ham.

> although it is.  I'm now thinking that maybe auto_learn is not a good
> idea, at least until I have a good enough Bayes database (strangely, SA
> did not catch *any* spam in the last 48 hours...).  Can you confirm
> this?
> 
> Thanks a lot, and sorry if I'm asking too much :-).

Just leave auto-learn enabled. And, yet again, do train both ham and
spam (all, not only mis-classified messages) for initial training.


Auto-learning in SA Bayes is much more than a pure feedback loop, as you
described. A message just being classified ham (< 5.0) is NOT learned as
ham. Neither are messages scored spam (>= 5.0) learned as spam.

(1) The thresholds for auto-learning are 0.1 and 12.0 by default. Not
    the required_score threshold of 5.0 default.
(2) Certain rules are not considered for auto-learning, to prevent self-
    feeding.
(3) A minimum of header and body rules are required, to prevent biasing.

See M::SA::Plugin::AutoLearnThreshold docs for more details.

Part of the X-Spam-Status header way down the end tells you about SA
auto-learning or not. Hardly surprising, that's
  autolearn=(ham|spam|no|unavailable)


In your case, I'd say just let SA do it's job. Monitor the results, and
train both ham and spam, at the very least until BAYES_xx rules show up
in X-Spam-Status headers.

Keep training Bayes after that, to improve performance. Definitely do
train on false positives and negatives.

Wait, observe, and learn how to read X-Spam headers. :)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: spamc -L apparently not working properly

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Saturday, November 09 2013, Karsten Bräckelmann wrote:

> Ham is good mail, messages you want (or actually subscribed to),
> messages sent to you with your consent. Spam is junk, unsolicited mail
> sent to you without your consent. Regardless of SA classification or
> score.
>
> False positives and negatives are messages mis-classified by SA.

On Saturday, November 09 2013, David B. Funk wrote:

> For Bayes to work it needs at least 200 examples of Ham (e-mail that
> you want) and 200 examples of Spam (e-mail that you don't want).
> It doesn't matter if the messages were correctly or not correctly
> classified by the rules-based SA engine, just what you consider
> Ham/Spam (IE correctly classified by -you-).
> In essence you are "teaching" the Bayes system how to recognize
> your preferences in e-mail classifying.
>
> So the messages you've kept in your INBOX should be good for Ham.

Nice, thanks both of you for the answers.

I am now feeding SA with ham from my INBOX, while I also feed it with
false-negatives (interestingly, I am receiving now *much* more spam than
I was a week ago...).

So, I now have yet another question.  I let auto_learn active for SA,
and now for every false-negative SA will learn that it is not spam,
although it is.  I'm now thinking that maybe auto_learn is not a good
idea, at least until I have a good enough Bayes database (strangely, SA
did not catch *any* spam in the last 48 hours...).  Can you confirm
this?

Thanks a lot, and sorry if I'm asking too much :-).

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sat, 2013-11-09 at 03:33 -0200, Sergio Durigan Junior wrote:
> On Saturday, November 09 2013, Karsten Bräckelmann wrote:
> 
> > You don't have any kind of archive of spam? If so, train on recent ones,
> > feel free to exceed the minimum limit, but don't bother too much with
> > old spam. It changes much faster over time than ham does.
> >
> > Also, at least until you reached the minimum required training, do train
> > with identified spam, too. Same with ham. For now, keep training in a
> > ratio somewhere between 1:1 or spam to ham ratio.
> 
> [Note: By ham I assume you mean false-positives, and not just regular
> e-mail.]

You're assuming wrong.

Ham is good mail, messages you want (or actually subscribed to),
messages sent to you with your consent. Spam is junk, unsolicited mail
sent to you without your consent. Regardless of SA classification or
score.

False positives and negatives are messages mis-classified by SA.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: spamc -L apparently not working properly

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Saturday, November 09 2013, Karsten Bräckelmann wrote:

> You don't have any kind of archive of spam? If so, train on recent ones,
> feel free to exceed the minimum limit, but don't bother too much with
> old spam. It changes much faster over time than ham does.
>
> Also, at least until you reached the minimum required training, do train
> with identified spam, too. Same with ham. For now, keep training in a
> ratio somewhere between 1:1 or spam to ham ratio.

[Note: By ham I assume you mean false-positives, and not just regular
e-mail.]

No, (un)fortunately I don't.  I've been running this server for 5 months
now, and only received about 10 spams so far.  I decided to start
running SA now because I've received 5 spams in the last 3 days, which
triggered my internal alarm.

> Do train. Spam, as well as ham. If you got some recent-ish archives.

Will do.  However, I don't have false-positives (ham) to train.  As I
said above, I only have about 10 spam messages, which I already used to
train Bayes.  Not sure if it is possible/would be good to search for
recent spam archives on the net.  I believe not...

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sat, 2013-11-09 at 01:34 -0200, Sergio Durigan Junior wrote:
> On Friday, November 08 2013, Karsten Bräckelmann wrote:

> > You mentioned that's a fresh install, actually not even in production
> > yet. The Bayes sub-system requires some training (minimum of 200 ham and
> > spam each) by default, before Bayes rules kick in for scanning.
> >
> > Instead of -c check only, use the -R option to print the report. You'll
> > notice there is no BAYES_xx rule (yet).
> 
> Thanks.  I had used -R before, without much success.  But yeah, I found
> some discussions on this list about Bayes databases, and people saying
> that at least 200 messages are needed before Bayes can start doing its
> job.
> 
> BTW, one spam has just sneaked in right now.  On the one hand I'm sad
> because of those false-negatives, but OTOH I'm happy because I'll be
> able to train the database faster :-).

You don't have any kind of archive of spam? If so, train on recent ones,
feel free to exceed the minimum limit, but don't bother too much with
old spam. It changes much faster over time than ham does.

Also, at least until you reached the minimum required training, do train
with identified spam, too. Same with ham. For now, keep training in a
ratio somewhere between 1:1 or spam to ham ratio.


> > > service, etc.  I was expecting that I'd get a high rate after feeding
> > > the spam to SpamAssassin, but that's not happening.  Any suggestions?
> >
> > In addition to required initial training:
> >
> > The Bayesian classifier works on a per-token (think: word) basis. Thus,
> > depending on the tokens in the message and existing ones in the db, the
> > impact of learning can vary quite a lot -- from hardly noticeable to
> > clear detection.
> 
> All right.  Since I don't have a good database yet (only 4 or 5 spams
> learned), I won't worry about it for now.  Let's see when I have a
> bigger DB...

Do train. Spam, as well as ham. If you got some recent-ish archives.


> Thanks a lot,

You're welcome. :)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: spamc -L apparently not working properly

Posted by Sergio Durigan Junior <se...@sergiodj.net>.
On Friday, November 08 2013, Karsten Bräckelmann wrote:

> On Fri, 2013-11-08 at 16:09 -0200, Sergio Durigan Junior wrote:
>>   #> spamc -c < spam.file
>>   0.0/5.0
>>   #> spamc -L spam < spam.file
>>   (successful message saying that the spam was learned)
>>   #> spamc -c < spam.file
>>   0.0/5.0
>
> You mentioned that's a fresh install, actually not even in production
> yet. The Bayes sub-system requires some training (minimum of 200 ham and
> spam each) by default, before Bayes rules kick in for scanning.
>
> Instead of -c check only, use the -R option to print the report. You'll
> notice there is no BAYES_xx rule (yet).

Thanks.  I had used -R before, without much success.  But yeah, I found
some discussions on this list about Bayes databases, and people saying
that at least 200 messages are needed before Bayes can start doing its
job.

BTW, one spam has just sneaked in right now.  On the one hand I'm sad
because of those false-negatives, but OTOH I'm happy because I'll be
able to train the database faster :-).

>> I have already updated my Bayesian database, restarted the spamd
>
> I'm curious -- what does updating your Bayes db mean?

Oh, I only meant that I ran "sa-learn" or "spamc -L".  Sorry if that is
a wrong nomenclature.

>> service, etc.  I was expecting that I'd get a high rate after feeding
>> the spam to SpamAssassin, but that's not happening.  Any suggestions?
>
> In addition to required initial training:
>
> The Bayesian classifier works on a per-token (think: word) basis. Thus,
> depending on the tokens in the message and existing ones in the db, the
> impact of learning can vary quite a lot -- from hardly noticeable to
> clear detection.

All right.  Since I don't have a good database yet (only 4 or 5 spams
learned), I won't worry about it for now.  Let's see when I have a
bigger DB...

Thanks a lot,

-- 
Sergio

Re: spamc -L apparently not working properly

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2013-11-08 at 16:09 -0200, Sergio Durigan Junior wrote:
>   #> spamc -c < spam.file
>   0.0/5.0
>   #> spamc -L spam < spam.file
>   (successful message saying that the spam was learned)
>   #> spamc -c < spam.file
>   0.0/5.0

You mentioned that's a fresh install, actually not even in production
yet. The Bayes sub-system requires some training (minimum of 200 ham and
spam each) by default, before Bayes rules kick in for scanning.

Instead of -c check only, use the -R option to print the report. You'll
notice there is no BAYES_xx rule (yet).


> I have already updated my Bayesian database, restarted the spamd

I'm curious -- what does updating your Bayes db mean?

> service, etc.  I was expecting that I'd get a high rate after feeding
> the spam to SpamAssassin, but that's not happening.  Any suggestions?

In addition to required initial training:

The Bayesian classifier works on a per-token (think: word) basis. Thus,
depending on the tokens in the message and existing ones in the db, the
impact of learning can vary quite a lot -- from hardly noticeable to
clear detection.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}