You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by fr...@ofb.net on 2015/10/18 06:35:56 UTC
newbie questions: sought, sa-learn, rule weights
Hello Users!
Apologies for asking multiple questions, I've just been reading
https://wiki.apache.org/spamassassin/ and have some things I wanted to
ask.
I'm getting a lot of spam, perhaps 25 messages/day, and about half of
it gets through Spamassassin. I'm trying to figure out how to fix the
situation.
I tried using the "sought" ruleset following instructions from
http://taint.org/2007/08/15/004348a.html, but didn't see much
difference.
I'm concerned that the BAYES_* rules aren't showing up in my spam
headers, and would like to know if there's a good way to look at the
tokens in the database. When I do "sa-learn --dump data", I see a file
with lines like this:
0.987 1 0 1436496897 0315e1da7f
0.016 0 1 1410284743 0320ba06ef
0.987 1 0 1393199297 0329ec4e6e
0.003 0 5 1268403253 03541effbc
0.008 0 2 1398222936 038d6e997d
0.016 0 1 1429567309 041cabf4ef
0.016 0 1 1431638107 041d441c1b
Is that normal? How do I get at the actual tokens? How do I see how it
scores a test message, just the Bayesian part? I find that I get a lot
of spam with exactly the same lines in the body of the message, and
the Bayesian classifier doesn't seem to register it.
Here's the output of sa-learn --dump magic:
0.000 0 3 0 non-token data: bayes db version
0.000 0 15466 0 non-token data: nspam
0.000 0 30317 0 non-token data: nham
0.000 0 1733267 0 non-token data: ntokens
0.000 0 1098575745 0 non-token data: oldest atime
0.000 0 1441160002 0 non-token data: newest atime
0.000 0 0 0 non-token data: last journal sync atime
0.000 0 1441160455 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime delta
0.000 0 0 0 non-token data: last expire reduction count
I couldn't find a sample output on your Wiki, with which to compare
this; I'm worried about the 0.000 lines and other zeroes.
I'm also thinking that I should employ some kind of sender address
whitelisting using e.g. TxRep. Most of my spam is stuff that I'm
receiving for the first time from a particular sender, and there are a
lot of strings that I can say for sure I'd never find in a Subject
line of a message from a friend who is emailing me for the first time:
"ATTN", "stock tip"... All of the mail I send is Bcc'ed to myself, is
there a way to get Spamassassin to notice when this comes in and
automatically whitelist the recipients for me?
Relatedly, if I create rules for e.g. ATTN, "stock tip", then I'd also
like to generate my own rule weights using my own spam/ham corpora.
Does it still take a week to do? Why did Spamassassin go back to using
a GA for this process? Aren't there some much faster algorithms
around?
Thank you in advance,
Frederick
Re: newbie questions: sought, sa-learn, rule weights
Posted by fr...@ofb.net.
Hi Reindl,
Thanks for your reply.
I replied separately to John about my Bayes setup - you were right,
wrong user.
Thanks for the advice about whitelisting being unnecessary. I hope
that getting the Bayesian part working will make my setup effective
without this.
Thanks,
Frederick
On Sun, Oct 18, 2015 at 11:29:16AM +0200, Reindl Harald wrote:
>
>
> Am 18.10.2015 um 06:35 schrieb frederik@ofb.net:
> >I'm concerned that the BAYES_* rules aren't showing up in my spam
> >headers
>
> you pretty sure train the wrong bayes instead the one of the user SA is
> running
>
> >and would like to know if there's a good way to look at the
> >tokens in the database
>
> there is no way at all, stripped hashes
>
> >When I do "sa-learn --dump data", I see a file
> >with lines like this:
> >
> >0.987 1 0 1436496897 0315e1da7f
> >0.016 0 1 1410284743 0320ba06ef
> >0.987 1 0 1393199297 0329ec4e6e
> >0.003 0 5 1268403253 03541effbc
> >0.008 0 2 1398222936 038d6e997d
> >0.016 0 1 1429567309 041cabf4ef
> >0.016 0 1 1431638107 041d441c1b
> >
> >Is that normal?
>
> yes
>
> >How do I get at the actual tokens?
>
> you don't
>
> >How do I see how it scores a test message, just the Bayesian part?
>
> you see BAYES_00 - BAYES_999 in the mailheaders
>
> >I find that I get a lot
> >of spam with exactly the same lines in the body of the message, and
> >the Bayesian classifier doesn't seem to register it.
>
> as said above: you train the wrong bayes
>
> >Here's the output of sa-learn --dump magic:
> >
> >0.000 0 3 0 non-token data: bayes db version
> >0.000 0 15466 0 non-token data: nspam
> >0.000 0 30317 0 non-token data: nham
> >0.000 0 1733267 0 non-token data: ntokens
> >0.000 0 1098575745 0 non-token data: oldest atime
> >0.000 0 1441160002 0 non-token data: newest atime
> >0.000 0 0 0 non-token data: last journal sync atime
> >0.000 0 1441160455 0 non-token data: last expiry atime
> >0.000 0 0 0 non-token data: last expire atime delta
> >0.000 0 0 0 non-token data: last expire reduction count
>
> FROM WHAT USER?
>
> >I couldn't find a sample output on your Wiki, with which to compare
> >this; I'm worried about the 0.000 lines and other zeroes.
>
> they are normal
>
> >I'm also thinking that I should employ some kind of sender address
> >whitelisting using e.g. TxRep. Most of my spam is stuff that I'm
> >receiving for the first time from a particular sender, and there are a
> >lot of strings that I can say for sure I'd never find in a Subject
> >line of a message from a friend who is emailing me for the first time:
> >"ATTN", "stock tip"... All of the mail I send is Bcc'ed to myself, is
> >there a way to get Spamassassin to notice when this comes in and
> >automatically whitelist the recipients for me?
>
> no need to do so and for sure you don't want it automatically, you *think*
> you want it - a blind whitelisting is easy to trick out with forged senders,
> whitelist_auth is based on DKIM/SPF precence
>
> but tyically you don't need much whitelisting except you are a hosting
> provier and care about your load (combining whitelist_auth and shortcircuit)
>
Re: newbie questions: sought, sa-learn, rule weights
Posted by Reindl Harald <h....@thelounge.net>.
Am 18.10.2015 um 06:35 schrieb frederik@ofb.net:
> I'm concerned that the BAYES_* rules aren't showing up in my spam
> headers
you pretty sure train the wrong bayes instead the one of the user SA is
running
> and would like to know if there's a good way to look at the
> tokens in the database
there is no way at all, stripped hashes
> When I do "sa-learn --dump data", I see a file
> with lines like this:
>
> 0.987 1 0 1436496897 0315e1da7f
> 0.016 0 1 1410284743 0320ba06ef
> 0.987 1 0 1393199297 0329ec4e6e
> 0.003 0 5 1268403253 03541effbc
> 0.008 0 2 1398222936 038d6e997d
> 0.016 0 1 1429567309 041cabf4ef
> 0.016 0 1 1431638107 041d441c1b
>
> Is that normal?
yes
> How do I get at the actual tokens?
you don't
> How do I see how it scores a test message, just the Bayesian part?
you see BAYES_00 - BAYES_999 in the mailheaders
> I find that I get a lot
> of spam with exactly the same lines in the body of the message, and
> the Bayesian classifier doesn't seem to register it.
as said above: you train the wrong bayes
> Here's the output of sa-learn --dump magic:
>
> 0.000 0 3 0 non-token data: bayes db version
> 0.000 0 15466 0 non-token data: nspam
> 0.000 0 30317 0 non-token data: nham
> 0.000 0 1733267 0 non-token data: ntokens
> 0.000 0 1098575745 0 non-token data: oldest atime
> 0.000 0 1441160002 0 non-token data: newest atime
> 0.000 0 0 0 non-token data: last journal sync atime
> 0.000 0 1441160455 0 non-token data: last expiry atime
> 0.000 0 0 0 non-token data: last expire atime delta
> 0.000 0 0 0 non-token data: last expire reduction count
FROM WHAT USER?
> I couldn't find a sample output on your Wiki, with which to compare
> this; I'm worried about the 0.000 lines and other zeroes.
they are normal
> I'm also thinking that I should employ some kind of sender address
> whitelisting using e.g. TxRep. Most of my spam is stuff that I'm
> receiving for the first time from a particular sender, and there are a
> lot of strings that I can say for sure I'd never find in a Subject
> line of a message from a friend who is emailing me for the first time:
> "ATTN", "stock tip"... All of the mail I send is Bcc'ed to myself, is
> there a way to get Spamassassin to notice when this comes in and
> automatically whitelist the recipients for me?
no need to do so and for sure you don't want it automatically, you
*think* you want it - a blind whitelisting is easy to trick out with
forged senders, whitelist_auth is based on DKIM/SPF precence
but tyically you don't need much whitelisting except you are a hosting
provier and care about your load (combining whitelist_auth and shortcircuit)
Re: newbie questions: sought, sa-learn, rule weights
Posted by RW <rw...@googlemail.com>.
On Mon, 19 Oct 2015 10:57:43 -0700
frederik@ofb.net wrote:
> I guess I need to use "spamc -L" rather than "sa-learn"? I tried
> "spamc -L" but it seems rather slow, about two messages per second,
> only slightly faster when the messages have already been seen. Is
> "sa-learn" faster than "spamc -L"? It seems to do closer to 8 message
> per second, although they were all "seen" messages.
>
> Perhaps I should just run spamd as my user, rather than user spamd?
> It's a single-user system... Or if it would be easier to point the
> global spamd to ~/.spamassassin/, but that seems messy...
run sa-learn as the user spamd using su
Re: newbie questions: sought, sa-learn, rule weights
Posted by fr...@ofb.net.
Hi John,
Thanks for your reply. Too bad the 'sought' rules are not working
anymore.
> You have plenty of tokens, so it's likely you're training Bayes as a
> different user than SA is running under, and you don't have a site-wide
> user-independent Bayes configured.
Yeah, looks like you're correct, I'm running spamd as root, with a
Systemd unit:
ExecStart=/usr/bin/vendor_perl/spamd --allow-tell -x -u spamd -g spamd
I guess I need to use "spamc -L" rather than "sa-learn"? I tried
"spamc -L" but it seems rather slow, about two messages per second,
only slightly faster when the messages have already been seen. Is
"sa-learn" faster than "spamc -L"? It seems to do closer to 8 message
per second, although they were all "seen" messages.
Perhaps I should just run spamd as my user, rather than user spamd?
It's a single-user system... Or if it would be easier to point the
global spamd to ~/.spamassassin/, but that seems messy...
> Care to post the rules hits for some of the FNs? That should be in their
> headers. That might let is provide more specific advice, for instance: are
> you hitting URIBL_BLOCKED?
Advice would be great...
From: info@freedesign.be
Subject: Sta niet in je blootje
X-Spam-Status: No, score=4.1 required=5.0 tests=DC_IMAGE_SPAM_TEXT,DKIM_SIGNED,
DKIM_VALID,DKIM_VALID_AU,HTML_IMAGE_RATIO_02,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,
SPF_PASS,T_SPF_HELO_TEMPERROR,UNWANTED_LANGUAGE_BODY autolearn=no
autolearn_force=no version=3.4.1
From: "John L. Cheeseman" <jo...@taiwantee.com>
Subject: How making $1250 a day is Possible
X-Spam-Status: No, score=1.7 required=5.0 tests=SPF_PASS,T_RP_MATCHES_RCVD,
T_SPF_HELO_TEMPERROR,URIBL_BLACK autolearn=no autolearn_force=no
version=3.4.1
From: "Bruce A. Stone" <br...@acnefixes.com>
Subject: This Could Change Everything in Bed
X-Spam-Status: No, score=1.0 required=5.0 tests=BODY_URI_ONLY,
T_RP_MATCHES_RCVD,T_SPF_HELO_TEMPERROR,T_SPF_TEMPERROR autolearn=no
autolearn_force=no version=3.4.1
From: "Anne M. Henderson" <an...@theblogcloud.com>
Subject: Want to Drop 3 Dress Sizes This Week?
X-Spam-Status: No, score=2.5 required=5.0 tests=SPF_PASS,T_RP_MATCHES_RCVD,
T_SPF_HELO_TEMPERROR,URIBL_DBL_SPAM autolearn=no autolearn_force=no
version=3.4.1
Let me know if any of the stock tips work out :)
Thanks,
Frederick
On Sun, Oct 18, 2015 at 12:29:19PM -0700, John Hardin wrote:
> On Sat, 17 Oct 2015, frederik@ofb.net wrote:
>
> >I'm getting a lot of spam, perhaps 25 messages/day, and about half of
> >it gets through Spamassassin. I'm trying to figure out how to fix the
> >situation.
>
> Care to post the rules hits for some of the FNs? That should be in their
> headers. That might let is provide more specific advice, for instance: are
> you hitting URIBL_BLOCKED?
>
> >I tried using the "sought" ruleset following instructions from
> >http://taint.org/2007/08/15/004348a.html, but didn't see much
> >difference.
>
> Sadly that's gone stale and may not help much with current spam. The last
> time I saw an update was March 2014.
>
> >I'm concerned that the BAYES_* rules aren't showing up in my spam
> >headers
>
> The two most common causes for that are, insufficient tokens learned and
> learning under teh wrong user.
>
> >Here's the output of sa-learn --dump magic:
> >
> >0.000 0 15466 0 non-token data: nspam
> >0.000 0 30317 0 non-token data: nham
>
> You have plenty of tokens, so it's likely you're training Bayes as a
> different user than SA is running under, and you don't have a site-wide
> user-independent Bayes configured.
>
> >Relatedly, if I create rules for e.g. ATTN, "stock tip",
>
> Funny you should mention that particular one. I just noticed it had popped
> up to the top of the masscheck corpora hits, and I've pushed a scored rule
> for it. Hopefully that will start getting points tomorrow.
>
>
> --
> John Hardin KA7OHZ http://www.impsec.org/~jhardin/
> jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
> I'll have that son of a bitch eating out of dumpsters in less than
> two years. -- MS CEO Steve Ballmer, on RedHat CEO Matt Szulik
> -----------------------------------------------------------------------
>
Re: newbie questions: sought, sa-learn, rule weights
Posted by John Hardin <jh...@impsec.org>.
On Sat, 17 Oct 2015, frederik@ofb.net wrote:
> I'm getting a lot of spam, perhaps 25 messages/day, and about half of
> it gets through Spamassassin. I'm trying to figure out how to fix the
> situation.
Care to post the rules hits for some of the FNs? That should be in their
headers. That might let is provide more specific advice, for instance: are
you hitting URIBL_BLOCKED?
> I tried using the "sought" ruleset following instructions from
> http://taint.org/2007/08/15/004348a.html, but didn't see much
> difference.
Sadly that's gone stale and may not help much with current spam. The last
time I saw an update was March 2014.
> I'm concerned that the BAYES_* rules aren't showing up in my spam
> headers
The two most common causes for that are, insufficient tokens learned and
learning under teh wrong user.
> Here's the output of sa-learn --dump magic:
>
> 0.000 0 15466 0 non-token data: nspam
> 0.000 0 30317 0 non-token data: nham
You have plenty of tokens, so it's likely you're training Bayes as a
different user than SA is running under, and you don't have a site-wide
user-independent Bayes configured.
> Relatedly, if I create rules for e.g. ATTN, "stock tip",
Funny you should mention that particular one. I just noticed it had popped
up to the top of the masscheck corpora hits, and I've pushed a scored
rule for it. Hopefully that will start getting points tomorrow.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
I'll have that son of a bitch eating out of dumpsters in less than
two years. -- MS CEO Steve Ballmer, on RedHat CEO Matt Szulik
-----------------------------------------------------------------------