You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by fr...@ofb.net on 2015/10/18 06:35:56 UTC

newbie questions: sought, sa-learn, rule weights

Hello Users!

Apologies for asking multiple questions, I've just been reading
https://wiki.apache.org/spamassassin/ and have some things I wanted to
ask.

I'm getting a lot of spam, perhaps 25 messages/day, and about half of
it gets through Spamassassin. I'm trying to figure out how to fix the
situation.

I tried using the "sought" ruleset following instructions from
http://taint.org/2007/08/15/004348a.html, but didn't see much
difference.

I'm concerned that the BAYES_* rules aren't showing up in my spam
headers, and would like to know if there's a good way to look at the
tokens in the database. When I do "sa-learn --dump data", I see a file
with lines like this:

0.987          1          0 1436496897  0315e1da7f
0.016          0          1 1410284743  0320ba06ef
0.987          1          0 1393199297  0329ec4e6e
0.003          0          5 1268403253  03541effbc
0.008          0          2 1398222936  038d6e997d
0.016          0          1 1429567309  041cabf4ef
0.016          0          1 1431638107  041d441c1b

Is that normal? How do I get at the actual tokens? How do I see how it
scores a test message, just the Bayesian part? I find that I get a lot
of spam with exactly the same lines in the body of the message, and
the Bayesian classifier doesn't seem to register it.

Here's the output of sa-learn --dump magic:

0.000          0          3          0  non-token data: bayes db version
0.000          0      15466          0  non-token data: nspam
0.000          0      30317          0  non-token data: nham
0.000          0    1733267          0  non-token data: ntokens
0.000          0 1098575745          0  non-token data: oldest atime
0.000          0 1441160002          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1441160455          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count

I couldn't find a sample output on your Wiki, with which to compare
this; I'm worried about the 0.000 lines and other zeroes.

I'm also thinking that I should employ some kind of sender address
whitelisting using e.g. TxRep. Most of my spam is stuff that I'm
receiving for the first time from a particular sender, and there are a
lot of strings that I can say for sure I'd never find in a Subject
line of a message from a friend who is emailing me for the first time:
"ATTN", "stock tip"... All of the mail I send is Bcc'ed to myself, is
there a way to get Spamassassin to notice when this comes in and
automatically whitelist the recipients for me?

Relatedly, if I create rules for e.g. ATTN, "stock tip", then I'd also
like to generate my own rule weights using my own spam/ham corpora.
Does it still take a week to do? Why did Spamassassin go back to using
a GA for this process? Aren't there some much faster algorithms
around?

Thank you in advance,

Frederick


Re: newbie questions: sought, sa-learn, rule weights

Posted by fr...@ofb.net.
Hi Reindl,

Thanks for your reply.

I replied separately to John about my Bayes setup - you were right,
wrong user.

Thanks for the advice about whitelisting being unnecessary. I hope
that getting the Bayesian part working will make my setup effective
without this.

Thanks,

Frederick

On Sun, Oct 18, 2015 at 11:29:16AM +0200, Reindl Harald wrote:
> 
> 
> Am 18.10.2015 um 06:35 schrieb frederik@ofb.net:
> >I'm concerned that the BAYES_* rules aren't showing up in my spam
> >headers
> 
> you pretty sure train the wrong bayes instead the one of the user SA is
> running
> 
> >and would like to know if there's a good way to look at the
> >tokens in the database
> 
> there is no way at all, stripped hashes
> 
> >When I do "sa-learn --dump data", I see a file
> >with lines like this:
> >
> >0.987          1          0 1436496897  0315e1da7f
> >0.016          0          1 1410284743  0320ba06ef
> >0.987          1          0 1393199297  0329ec4e6e
> >0.003          0          5 1268403253  03541effbc
> >0.008          0          2 1398222936  038d6e997d
> >0.016          0          1 1429567309  041cabf4ef
> >0.016          0          1 1431638107  041d441c1b
> >
> >Is that normal?
> 
> yes
> 
> >How do I get at the actual tokens?
> 
> you don't
> 
> >How do I see how it scores a test message, just the Bayesian part?
> 
> you see BAYES_00 - BAYES_999 in the mailheaders
> 
> >I find that I get a lot
> >of spam with exactly the same lines in the body of the message, and
> >the Bayesian classifier doesn't seem to register it.
> 
> as said above: you train the wrong bayes
> 
> >Here's the output of sa-learn --dump magic:
> >
> >0.000          0          3          0  non-token data: bayes db version
> >0.000          0      15466          0  non-token data: nspam
> >0.000          0      30317          0  non-token data: nham
> >0.000          0    1733267          0  non-token data: ntokens
> >0.000          0 1098575745          0  non-token data: oldest atime
> >0.000          0 1441160002          0  non-token data: newest atime
> >0.000          0          0          0  non-token data: last journal sync atime
> >0.000          0 1441160455          0  non-token data: last expiry atime
> >0.000          0          0          0  non-token data: last expire atime delta
> >0.000          0          0          0  non-token data: last expire reduction count
> 
> FROM WHAT USER?
> 
> >I couldn't find a sample output on your Wiki, with which to compare
> >this; I'm worried about the 0.000 lines and other zeroes.
> 
> they are normal
> 
> >I'm also thinking that I should employ some kind of sender address
> >whitelisting using e.g. TxRep. Most of my spam is stuff that I'm
> >receiving for the first time from a particular sender, and there are a
> >lot of strings that I can say for sure I'd never find in a Subject
> >line of a message from a friend who is emailing me for the first time:
> >"ATTN", "stock tip"... All of the mail I send is Bcc'ed to myself, is
> >there a way to get Spamassassin to notice when this comes in and
> >automatically whitelist the recipients for me?
> 
> no need to do so and for sure you don't want it automatically, you *think*
> you want it - a blind whitelisting is easy to trick out with forged senders,
> whitelist_auth is based on DKIM/SPF precence
> 
> but tyically you don't need much whitelisting except you are a hosting
> provier and care about your load (combining whitelist_auth and shortcircuit)
> 



Re: newbie questions: sought, sa-learn, rule weights

Posted by Reindl Harald <h....@thelounge.net>.

Am 18.10.2015 um 06:35 schrieb frederik@ofb.net:
> I'm concerned that the BAYES_* rules aren't showing up in my spam
> headers

you pretty sure train the wrong bayes instead the one of the user SA is 
running

> and would like to know if there's a good way to look at the
> tokens in the database

there is no way at all, stripped hashes

> When I do "sa-learn --dump data", I see a file
> with lines like this:
>
> 0.987          1          0 1436496897  0315e1da7f
> 0.016          0          1 1410284743  0320ba06ef
> 0.987          1          0 1393199297  0329ec4e6e
> 0.003          0          5 1268403253  03541effbc
> 0.008          0          2 1398222936  038d6e997d
> 0.016          0          1 1429567309  041cabf4ef
> 0.016          0          1 1431638107  041d441c1b
>
> Is that normal?

yes

> How do I get at the actual tokens?

you don't

> How do I see how it scores a test message, just the Bayesian part?

you see BAYES_00 - BAYES_999 in the mailheaders

> I find that I get a lot
> of spam with exactly the same lines in the body of the message, and
> the Bayesian classifier doesn't seem to register it.

as said above: you train the wrong bayes

> Here's the output of sa-learn --dump magic:
>
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0      15466          0  non-token data: nspam
> 0.000          0      30317          0  non-token data: nham
> 0.000          0    1733267          0  non-token data: ntokens
> 0.000          0 1098575745          0  non-token data: oldest atime
> 0.000          0 1441160002          0  non-token data: newest atime
> 0.000          0          0          0  non-token data: last journal sync atime
> 0.000          0 1441160455          0  non-token data: last expiry atime
> 0.000          0          0          0  non-token data: last expire atime delta
> 0.000          0          0          0  non-token data: last expire reduction count

FROM WHAT USER?

> I couldn't find a sample output on your Wiki, with which to compare
> this; I'm worried about the 0.000 lines and other zeroes.

they are normal

> I'm also thinking that I should employ some kind of sender address
> whitelisting using e.g. TxRep. Most of my spam is stuff that I'm
> receiving for the first time from a particular sender, and there are a
> lot of strings that I can say for sure I'd never find in a Subject
> line of a message from a friend who is emailing me for the first time:
> "ATTN", "stock tip"... All of the mail I send is Bcc'ed to myself, is
> there a way to get Spamassassin to notice when this comes in and
> automatically whitelist the recipients for me?

no need to do so and for sure you don't want it automatically, you 
*think* you want it - a blind whitelisting is easy to trick out with 
forged senders, whitelist_auth is based on DKIM/SPF precence

but tyically you don't need much whitelisting except you are a hosting 
provier and care about your load (combining whitelist_auth and shortcircuit)


Re: newbie questions: sought, sa-learn, rule weights

Posted by RW <rw...@googlemail.com>.
On Mon, 19 Oct 2015 10:57:43 -0700
frederik@ofb.net wrote:


> I guess I need to use "spamc -L" rather than "sa-learn"? I tried
> "spamc -L" but it seems rather slow, about two messages per second,
> only slightly faster when the messages have already been seen. Is
> "sa-learn" faster than "spamc -L"? It seems to do closer to 8 message
> per second, although they were all "seen" messages.
> 
> Perhaps I should just run spamd as my user, rather than user spamd?
> It's a single-user system... Or if it would be easier to point the
> global spamd to ~/.spamassassin/, but that seems messy...

run sa-learn as the user spamd using su

Re: newbie questions: sought, sa-learn, rule weights

Posted by fr...@ofb.net.
Hi John,

Thanks for your reply. Too bad the 'sought' rules are not working
anymore.

> You have plenty of tokens, so it's likely you're training Bayes as a
> different user than SA is running under, and you don't have a site-wide
> user-independent Bayes configured.

Yeah, looks like you're correct, I'm running spamd as root, with a
Systemd unit:

ExecStart=/usr/bin/vendor_perl/spamd --allow-tell -x -u spamd -g spamd

I guess I need to use "spamc -L" rather than "sa-learn"? I tried
"spamc -L" but it seems rather slow, about two messages per second,
only slightly faster when the messages have already been seen. Is
"sa-learn" faster than "spamc -L"? It seems to do closer to 8 message
per second, although they were all "seen" messages.

Perhaps I should just run spamd as my user, rather than user spamd?
It's a single-user system... Or if it would be easier to point the
global spamd to ~/.spamassassin/, but that seems messy...



> Care to post the rules hits for some of the FNs? That should be in their
> headers. That might let is provide more specific advice, for instance: are
> you hitting URIBL_BLOCKED?

Advice would be great...

From: info@freedesign.be
Subject: Sta niet in je blootje
X-Spam-Status: No, score=4.1 required=5.0 tests=DC_IMAGE_SPAM_TEXT,DKIM_SIGNED,
        DKIM_VALID,DKIM_VALID_AU,HTML_IMAGE_RATIO_02,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,
        SPF_PASS,T_SPF_HELO_TEMPERROR,UNWANTED_LANGUAGE_BODY autolearn=no
        autolearn_force=no version=3.4.1

From: "John L. Cheeseman" <jo...@taiwantee.com>
Subject: How making $1250 a day is Possible
X-Spam-Status: No, score=1.7 required=5.0 tests=SPF_PASS,T_RP_MATCHES_RCVD,
        T_SPF_HELO_TEMPERROR,URIBL_BLACK autolearn=no autolearn_force=no
        version=3.4.1
                
From: "Bruce A. Stone" <br...@acnefixes.com>
Subject: This Could Change Everything in Bed
X-Spam-Status: No, score=1.0 required=5.0 tests=BODY_URI_ONLY,
        T_RP_MATCHES_RCVD,T_SPF_HELO_TEMPERROR,T_SPF_TEMPERROR autolearn=no
        autolearn_force=no version=3.4.1

From: "Anne M. Henderson" <an...@theblogcloud.com>
Subject: Want to Drop 3 Dress Sizes This Week?
X-Spam-Status: No, score=2.5 required=5.0 tests=SPF_PASS,T_RP_MATCHES_RCVD,
        T_SPF_HELO_TEMPERROR,URIBL_DBL_SPAM autolearn=no autolearn_force=no
        version=3.4.1


Let me know if any of the stock tips work out :)

Thanks,

Frederick

On Sun, Oct 18, 2015 at 12:29:19PM -0700, John Hardin wrote:
> On Sat, 17 Oct 2015, frederik@ofb.net wrote:
> 
> >I'm getting a lot of spam, perhaps 25 messages/day, and about half of
> >it gets through Spamassassin. I'm trying to figure out how to fix the
> >situation.
> 
> Care to post the rules hits for some of the FNs? That should be in their
> headers. That might let is provide more specific advice, for instance: are
> you hitting URIBL_BLOCKED?
> 
> >I tried using the "sought" ruleset following instructions from
> >http://taint.org/2007/08/15/004348a.html, but didn't see much
> >difference.
> 
> Sadly that's gone stale and may not help much with current spam. The last
> time I saw an update was March 2014.
> 
> >I'm concerned that the BAYES_* rules aren't showing up in my spam
> >headers
> 
> The two most common causes for that are, insufficient tokens learned and
> learning under teh wrong user.
> 
> >Here's the output of sa-learn --dump magic:
> >
> >0.000          0      15466          0  non-token data: nspam
> >0.000          0      30317          0  non-token data: nham
> 
> You have plenty of tokens, so it's likely you're training Bayes as a
> different user than SA is running under, and you don't have a site-wide
> user-independent Bayes configured.
> 
> >Relatedly, if I create rules for e.g. ATTN, "stock tip",
> 
> Funny you should mention that particular one. I just noticed it had popped
> up to the top of the masscheck corpora hits, and I've pushed a scored rule
> for it. Hopefully that will start getting points tomorrow.
> 
> 
> -- 
>  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
>  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
>  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
>   I'll have that son of a bitch eating out of dumpsters in less than
>   two years.       -- MS CEO Steve Ballmer, on RedHat CEO Matt Szulik
> -----------------------------------------------------------------------
> 

Re: newbie questions: sought, sa-learn, rule weights

Posted by John Hardin <jh...@impsec.org>.
On Sat, 17 Oct 2015, frederik@ofb.net wrote:

> I'm getting a lot of spam, perhaps 25 messages/day, and about half of
> it gets through Spamassassin. I'm trying to figure out how to fix the
> situation.

Care to post the rules hits for some of the FNs? That should be in their 
headers. That might let is provide more specific advice, for instance: are 
you hitting URIBL_BLOCKED?

> I tried using the "sought" ruleset following instructions from
> http://taint.org/2007/08/15/004348a.html, but didn't see much
> difference.

Sadly that's gone stale and may not help much with current spam. The last 
time I saw an update was March 2014.

> I'm concerned that the BAYES_* rules aren't showing up in my spam
> headers

The two most common causes for that are, insufficient tokens learned and 
learning under teh wrong user.

> Here's the output of sa-learn --dump magic:
>
> 0.000          0      15466          0  non-token data: nspam
> 0.000          0      30317          0  non-token data: nham

You have plenty of tokens, so it's likely you're training Bayes as a 
different user than SA is running under, and you don't have a site-wide 
user-independent Bayes configured.

> Relatedly, if I create rules for e.g. ATTN, "stock tip",

Funny you should mention that particular one. I just noticed it had popped 
up to the top of the masscheck corpora hits, and I've pushed a scored 
rule for it. Hopefully that will start getting points tomorrow.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   I'll have that son of a bitch eating out of dumpsters in less than
   two years.       -- MS CEO Steve Ballmer, on RedHat CEO Matt Szulik
-----------------------------------------------------------------------