You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Stan Hoeppner <st...@hardwarefreak.com> on 2013/10/12 18:26:26 UTC

FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

These two rules are adding 4.0 pts to most email I receive from
debian-user that originates from Gmane, putting quite a bit of legit
Debian list mail into my spam folder.  With these two rules adding 4
pts, a tiny hit from any other rule puts it over the top.

FSL_HELO_BARE_IP_2 needs to have a -much- lower score, or be eliminated
entirely, as it overlaps with at least 3 other tests, as pointed out
previously by another user.  If a message makes it through Gmane, and
Debian, and then gets flagged by my "stock" rules introduced through an
auto update, then something is obviously wrong.



Example:


Content analysis details:   (4.8 points, 4.2 required)

 pts rule name              description
---- ----------------------
--------------------------------------------------
 2.8 FSL_HELO_BARE_IP_2     FSL_HELO_BARE_IP_2
 1.2 RCVD_NUMERIC_HELO      Received: contains an IP address used for HELO
 0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
                            [score: 0.5314]




Received: from bendel.debian.org (bendel.debian.org [82.195.75.100])
	by greer.hardwarefreak.com (Postfix) with ESMTP id C95BD6C0CE
	for <st...@hardwarefreak.com>; Sat, 12 Oct 2013 10:23:37 -0500 (CDT)
Received: from localhost (localhost [127.0.0.1])
	by bendel.debian.org (Postfix) with QMQP
	id 6EDAF1AD; Sat, 12 Oct 2013 15:23:26 +0000 (UTC)
Old-Return-Path: <gl...@m.gmane.org>
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on bendel.debian.org
X-Spam-Level:
X-Spam-Status: No, score=-9.6 required=4.0 tests=FOURLA,FREEMAIL_FROM,
	LDOSUBSCRIBER,LDO_WHITELIST,RCVD_NUMERIC_HELO,T_RP_MATCHES_RCVD,
	T_TO_NO_BRKTS_FREEMAIL autolearn=unavailable version=3.3.2
X-Original-To: lists-debian-user@bendel.debian.org
Delivered-To: lists-debian-user@bendel.debian.org
Received: from localhost (localhost [127.0.0.1])
	by bendel.debian.org (Postfix) with ESMTP id E553017F
	for <li...@bendel.debian.org>; Sat, 12 Oct 2013 15:23:16
+0000 (UTC)
X-Virus-Scanned: at lists.debian.org with policy bank en-ht
X-Amavis-Spam-Status: No, score=-5.735 tagged_above=-10000 required=5.3
	tests=[BAYES_00=-2, FOURLA=0.1, FREEMAIL_FROM=0.001, LDO_WHITELIST=-5,
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_NUMERIC_HELO=1.164,
	T_RP_MATCHES_RCVD=-0.01, T_TO_NO_BRKTS_FREEMAIL=0.01] autolearn=ham
Received: from bendel.debian.org ([127.0.0.1])
	by localhost (lists.debian.org [127.0.0.1]) (amavisd-new, port 2525)
	with ESMTP id pt3Xy4CFnT1w for <li...@bendel.debian.org>;
	Sat, 12 Oct 2013 15:23:09 +0000 (UTC)
X-policyd-weight: using cached result; rate:hard: -6.1
Received: from plane.gmane.org (plane.gmane.org [80.91.229.3])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by bendel.debian.org (Postfix) with ESMTPS id 4754D14C
	for <de...@lists.debian.org>; Sat, 12 Oct 2013 15:23:09 +0000 (UTC)
Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <gl...@m.gmane.org>)
	id 1VV122-0007gh-Sa
	for debian-user@lists.debian.org; Sat, 12 Oct 2013 17:23:06 +0200
Received: from 189.151.225.189 ([189.151.225.189])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <de...@lists.debian.org>; Sat, 12 Oct 2013 17:23:06 +0200
Received: from hvw59601 by 189.151.225.189 with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <de...@lists.debian.org>; Sat, 12 Oct 2013 17:23:06 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: debian-user@lists.debian.org
From: Hugo Vanwoerkom <hv...@care2.com>
Subject: Re: linux-image-3.10-3-amd64 unbootable: /dev/disk/by-uuid not
created
Date: Sat, 12 Oct 2013 10:22:56 -0500
Lines: 30
Message-ID: <l3...@ger.gmane.org>
References: <52...@opendreams.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: 189.151.225.189
User-Agent: Thunderbird 2.0.0.24 (X11/20101027)
In-Reply-To: <52...@opendreams.net>
X-Rc-Virus: 2007-09-13_01
X-Rc-Spam: 2008-11-04_01
Resent-Message-ID: <kb...@bendel>
Resent-From: debian-user@lists.debian.org
X-Mailing-List: <de...@lists.debian.org> archive/latest/658107
X-Loop: debian-user@lists.debian.org
List-Id: <debian-user.lists.debian.org>
List-Post: <ma...@lists.debian.org>
List-Help: <mailto:debian-user-request@lists.debian.org?subject=help>
List-Subscribe:
<mailto:debian-user-request@lists.debian.org?subject=subscribe>
List-Unsubscribe:
<mailto:debian-user-request@lists.debian.org?subject=unsubscribe>
Precedence: list
Resent-Sender: debian-user-request@lists.debian.org
Resent-Date: Sat, 12 Oct 2013 15:23:26 +0000 (UTC)

Jesse Molina wrote:
>
> Hi
>
> I have a Debian unstable host which successfully boots from the
> linux-image-3.10-1-amd64 kernel package.  However, I recently installed
> the linux-image-3.10-3-amd64 kernel package, and it is unbootable.
>
> When I boot from the linux-image-3.10-3-amd64 package kernel, the boot
> fails and drops me into the initramfs busybox.  The messge "Gave up
> waiting for the root device." appears, along with "ALERT!
> /dev/disk/by-uuid/bla-bla-bla-my-id-here does not exist.".
>
> The problem appears to be that udev is not creating /dev/disk/by-uuid/*
> and similar objects.  The only directory being created in /dev/disk is
> "by-id".  Note that the mdadm arrays are being successfully assembled
> and I can see them if I cat /proc/mdstat.
>
> the root= argument in grub is a UUID of a mdadm RAID1 array.  This
> host's boot part is a RAID1, and the root part is a RAID5.  This is
> standard PC desktop hardware with four disk drives upon which the md
> RAIDs are built.
>
> The host has been dist-upgraded as of this time.
>
> Advice appreciated.  Otherwise, I'll file a bug on it.

try 'rootdelay=5' (w/o quotes) as bootparam.

Hugo


-- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a
subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/l3bpg7$gra$1@ger.gmane.org

Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by SM <sm...@resistor.net>.
At 02:56 15-10-2013, Stan Hoeppner wrote:
>In both cases the last two Received: headers in each message are
>forgeries as no SMTP transaction occurred.  I'm sure this violates more
>than one SMTP RFC, but I doubt Gmane will change the way they do this
>any time soon.

I don't think that there is any violation of the specification.

Regards,
-sm 


Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by John Hardin <jh...@impsec.org>.
On Thu, 17 Oct 2013, Stan Hoeppner wrote:

> Whether Gmane is violating RFC or not isn't my concern.  What is my
> concern is that the way they create these headers is breaking the two
> rules in the subject line.  Apparently a fix is already in place to
> prevent these two rules from being applied to list mail.

I only adjusted FSL_HELO_BARE_IP_2, I didn't take a look at 
RCVD_NUMERIC_HELO.

> Now that the exact nature of the problem is known, maybe the test can be 
> modified to work with some list mail, but not this particular 'flavor'.

That's a possibility. Having Received: exceptions for certain cases may be 
justified, but that has to be weighed against the possibility of a spammer 
forging a Received: header that allows the message to bypass an effective 
rule.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   If the rock of doom requires a gentle nudge away from Gaia to
   prevent a very bad day for Earthlings, NASA won’t be riding to the
   rescue. These days, NASA does dodgy weather research and outreach
   programs, not stuff in actual space with rockets piloted by
   flinty-eyed men called Buzz.                       -- Daily Bayonet
-----------------------------------------------------------------------
  504 days since the first successful private support mission to ISS (SpaceX)

Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Stan Hoeppner <st...@hardwarefreak.com>.
On 10/17/2013 2:09 PM, Jonas Eckerman wrote:
> I answer privately since this really isn't about SpamAssassin any more, and SpamAssassin isn't about RFC conformance.

Oh, but it does directly relate to the above two rules.  And I believe
this is a healthy discussion.  It will educate others as to exactly why
these rules break when analyzing some list mail, lists which are the
backbone of the FLOSS community.  Before now the maintainers were making
decisions about applying these rules to list mail based solely on
statistics.  This discussion has fleshed out the 'why'.  The resulting
knowledge may be helpful in writing better, more intelligent rules in
the future.  So I hope it's ok with you that I'm copying this back to
the list.

>>> In what way are they forged?
> 
>> Do you honestly not get it, or are you being a troll or pedant?
> 
> I honestly didn't get it, and I'm not a troll. Whether I'm a pedant or not would be a matter of opinion (where you're free to hold one different from mine).

I hope I didn't sound accusatory here Jonas.  I've been around the
'interwebs' long enough to know that now and then people try to suck you
into an argument over minutia for the sake of arguing.  I thought there
was a chance that was the case here, so I simply asked.

>> Or, if you simply re-read my msg maybe it will become clear.
> 
> I've re-read the message. I can see that I might have quoted badly (I read and replied with my mobile phone, which I know think was a bad idea), and I apologize for that.

No problem.

> It still seems that you're saying the received headers are forged when they were inserted for transfers not involving SMTP though, even if you also pointed to other errors in the headers in question.

Only one that is claimed to be "esmtp" is forged.  That's the 2nd
header.  The first is not, as it clearly stated "local" injection.
That's the PHP end of it.

>> To create a record apparently in case of abuse, Gmane in particular injects
>> the rDNS string of the HTTP client machine into the EHLO position of a
>> Received: header, using the bare IP upon NXDOMAIN or SERVFAIL.
> 
> I see two problems with calling that a forgery:

Yeah, scratch that.  I painted with an overly broad brush in my first
post.  Only the 2nd header is the problem, the first is fine.

> 1: The idea of a EHLO position is only relevant for protocols with a EHLO command. When the message is received using HTTP, NNTP, UUCP, local pipes, etc, there is no EHLO command.

Of course.

> 2: The Received headers have not always been as strictly defined as one might wish. 

Absolutely agreed.  Not all MTAs use the same Received: header format.
But I assume this was taken into account in these two tests.  I've not
actually looked at the regexes yet.

> Even now putting in the EHLO parameter is a SHOULD, not a MUST, for SMTP.

Thankfully most serious MTA writers treat these directives the same.

> On to the first set of received headers, I'm commenting the first (in insertion order) two headers.
> 
>> Received: from stan by mo-65-41-216-221.sta.embarqhsd.net with local (Gmexim 0.1 (Debian))
>>  id 1AlnuQ-0007hv-00
>>  for <de...@lists.debian.org>; Tue, 15 Oct 2013 09:40:02 +0200
> 
> Was mo-65-41-216-221.sta.embarqhsd.net your RDNS when posting that? If it was I agree that this header is a forgery. If, oth, mo-65-41-216-221.sta.embarqhsd.net really was the machine receiving a locally submitted message (possibly from a PHP script) from "stan" (wich I'd guess is a user name since that is common to put in that position for locally submitted mesages), it seems just fine.

Yes.  This is my PC's FCrDNS.  This injection stamping is warranted, and
wanted by receivers, and is clearly stated as 'local' injection.  This
is a good thing, absolutely.

>> Received: from mo-65-41-216-221.sta.embarqhsd.net ([65.41.216.221])
>>  by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
>>  id 1AlnuQ-0007hv-00
>>  for <de...@lists.debian.org>; Tue, 15 Oct 2013 09:40:02 +0200
> 
> If the previous header was correctly listing it's own address in the "by" clause, this seems just fine. If the address, mo-65-41-216-221.sta.embarqhsd.net, was actually your address, this header is of course incorrect as well.

Yes, this is the one that is forged.  There was no ESMTP transaction
between 65.41.216.221, my PC, and main.gmane.org, as this header above
clearly states there was.  Maybe "fabricated" or "manufactured" is a
better word.  They're not pretending it to be something it is not, but
merely creating it out of thin air.

Whether Gmane is violating RFC or not isn't my concern.  What is my
concern is that the way they create these headers is breaking the two
rules in the subject line.  Apparently a fix is already in place to
prevent these two rules from being applied to list mail.  Now that the
exact nature of the problem is known, maybe the test can be modified to
work with some list mail, but not this particular 'flavor'.

> If mo-65-41-216-221.sta.embarqhsd.net was your address, a combination of the two icnorrect headers could become one correct one header...

Getting Gmane to change their headers isn't the goal here.  Changing the
SA tests to deal with it, is, I think, a reasonable goal.

> Maybe I got it right now, maybe not.

I think you were close. :)

-- 
Stan


Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Stan Hoeppner <st...@hardwarefreak.com>.
On 10/16/2013 3:01 AM, Jonas Eckerman wrote:
>> Operators of newsgroups which mirror/archive mailing
>> lists, and allow posting from a web interface, are adding forged
>> Received: headers before sending an email to the respective list
>> server.
> 
> In what way are they forged?

I'm to this list.  Before I waste my time replying to what seems to be
trolling, or pedantry, I'd like to know.  I laid out the case with all
the evidence, half of which you cut in your reply.

Do you honestly not get it, or are you being a troll or pedant?  If the
former, I'll politely attempt to explain it so you can understand it.
Or, if you simply re-read my msg maybe it will become clear.

Regards.

-- 
Stan



Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Jonas Eckerman <jo...@truls.org>.
>Operators of newsgroups which mirror/archive mailing
>lists, and allow posting from a web interface, are adding forged
>Received: headers before sending an email to the respective list
>server.

In what way are they forged? Do they contain addresses that doesn't match the system adding the received-line or the system it received the message from? 

>In both cases the last two Received: headers in each message are
>forgeries as no SMTP transaction occurred.

Does those headers say that a SMTP transaction occurred? If they don't, what is forced? 

I'm not sure server you mean "last in insertion order" or "last in reading order" so I'll answer for both. :-) 

Insertion order:

>Received: from list by plane.gmane.org with local (Exim 4.69)
>	(envelope-from <gl...@m.gmane.org>)
>	id 1VVzEY-0005lJ-P1
>	for debian-user@lists.debian.org; Tue, 15 Oct 2013 09:40:02 +0200

This one says it was received locally without using SMTP. This is normal when a message is sent/queued by a local application.

>Received: from plane.gmane.org (plane.gmane.org [80.91.229.3])
>	(using TLSv1 with cipher AES256-SHA (256/256 bits))
>	(Client did not present a certificate)
>	by bendel.debian.org (Postfix) with ESMTPS id 7DD8CA6
>	for <de...@lists.debian.org>; Tue, 15 Oct 2013 07:40:05 +0000
>(UTC)

This one says nothing says that the message was received with a ESMTP. Do you know that it wasn't? 

Reading order:

>Received: from 94.79.44.98 ([94.79.44.98])
>        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
>        id 1AlnuQ-0007hv-00
>    for <de...@lists.debian.org>; Sun, 13 Oct 2013 19:40:43 +0200

This one says it was received with ESMTP. Again, do you know it wasn't? 

>Received: from freehck by 94.79.44.98 with local (Gmexim 0.1 (Debian))
>        id 1AlnuQ-0007hv-00
>    for <de...@lists.debian.org>; Sun, 13 Oct 2013 19:40:43 +0200

This one says it was received locally without SMTP. This is perfectly normal if it was received from a local application, for example a web server running a PHP script or a gateway fetching messaging from something else. 

>I'm sure this violates more
>than one SMTP RFC, but I doubt Gmane will change the way they do this
>any time soon.

I don't think it does. Trace headers are useful for mail regardless of the protocol used for the transfers between systems/applications, and are defined in the Internet Mail Format RFCs (822 descendants,  not sure what the current one is but if you start at 2822 you should be able to find it). 

(Also, does the SMTP RFCs really apply when your not using SMTP?) 

Regards
/jonas 
-- 
 Monypholite gemgas. 

Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Stan Hoeppner <st...@hardwarefreak.com>.
On 10/12/2013 9:28 PM, John Hardin wrote:
> On Sat, 12 Oct 2013, Stan Hoeppner wrote:
> 
>> Steve, the one who wrote this regex, would you please explain your
>> reasoning behind giving this rule a score so high as 2.8,
> 
> That score was auto-assigned by masscheck, where it is doing quite well:
> 
> http://ruleqa.spamassassin.org/?rule=FSL_HELO_BARE_IP_2
> 
>> and engage in discussion WRT lowering the score, eliminating the
>> overlap with the other bare IP HELO rules, etc?
> 
> It seems that 94% of the ham hits in masscheck are against list mail,
> and none of the spam hits are, so it would seem reasonable to add an
> exclusion for list messages.

I did some digging and have discovered precisely why FSL_HELO_BARE_IP_2,
RCVD_NUMERIC_HELO, et al falsely hit on much list mail.  It's quite
interesting.  Operators of newsgroups which mirror/archive mailing
lists, and allow posting from a web interface, are adding forged
Received: headers before sending an email to the respective list server.

To create a record apparently in case of abuse, Gmane in particular
injects the rDNS string of the HTTP client machine into the EHLO
position of a Received: header, using the bare IP upon NXDOMAIN or
SERVFAIL.  There is no SMTP transaction between the hosts, only a PHP
form.  I just tested it:

...
Received: from plane.gmane.org (plane.gmane.org [80.91.229.3])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by bendel.debian.org (Postfix) with ESMTPS id 7DD8CA6
	for <de...@lists.debian.org>; Tue, 15 Oct 2013 07:40:05 +0000 (UTC)
Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <gl...@m.gmane.org>)
	id 1VVzEY-0005lJ-P1
	for debian-user@lists.debian.org; Tue, 15 Oct 2013 09:40:02 +0200
Received: from mo-65-41-216-221.sta.embarqhsd.net ([65.41.216.221])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <de...@lists.debian.org>; Tue, 15 Oct 2013 09:40:02 +0200
Received: from stan by mo-65-41-216-221.sta.embarqhsd.net with local
(Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <de...@lists.debian.org>; Tue, 15 Oct 2013 09:40:02 +0200
X-Injected-Via-Gmane: http://gmane.org/


An example from my spam folder, host IP returns NXDOMAIN:
...
Received: from plane.gmane.org (plane.gmane.org [80.91.229.3])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by bendel.debian.org (Postfix) with ESMTPS id BAA4A1F1
	for <de...@lists.debian.org>; Sun, 13 Oct 2013 17:40:46 +0000 (UTC)
Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <gl...@m.gmane.org>)
	id 1VVPel-0003yo-CD
	for debian-user@lists.debian.org; Sun, 13 Oct 2013 19:40:43 +0200
Received: from 94.79.44.98 ([94.79.44.98])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <de...@lists.debian.org>; Sun, 13 Oct 2013 19:40:43 +0200
Received: from freehck by 94.79.44.98 with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <de...@lists.debian.org>; Sun, 13 Oct 2013 19:40:43 +0200
X-Injected-Via-Gmane: http://gmane.org/


In both cases the last two Received: headers in each message are
forgeries as no SMTP transaction occurred.  I'm sure this violates more
than one SMTP RFC, but I doubt Gmane will change the way they do this
any time soon.


My spam folder goes back to 09/2012, has 3341 msgs.

$ grep -P "FSL_HELO_BARE_IP_2|RCVD_NUMERIC_HELO"
/home/stan/mail/Recent-Spam -c
1188

$ grep "FSL_HELO_BARE_IP_2" /home/stan/mail/Recent-Spam -c
553

$ grep "main.gmane.org" /home/stan/mail/Recent-Spam -c
166

$ grep -B1 "main.gmane.org" /home/stan/mail/Recent-Spam
...
Received: from 209.239.228.34 ([209.239.228.34])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))

$ grep -A53 "FSL_HELO_BARE_IP_2" /home/stan/mail/Recent-Spam|grep
"main.gmane.org" -c
106

All 106 are ham.

$ grep -A53 "RCVD_NUMERIC_HELO" /home/stan/mail/Recent-Spam|grep
"main.gmane.org" -c
147

All 147 are ham.

Of the 553 that hi FSL_HELO_BARE_IP_2, some of that may be other such
newsgroup injected mail.  I didn't dig that deep as we have plenty of
data already demonstrating that this test shouldn't be applied to list
mail.  In addition, any tests that target broadband/consumer rDNS
patterns at HELO strings should also be excluded from list mail, given
that the HTTP client rDNS is injected as the HELO string by the likes of
Gmane et al, when rDNS exists.  I assume such tests must exist in SA as
they're fantastic for identifying bot spam.

In fact, I've maintained for a few years now a Postfix PCRE table of
~1650 fully qualified expressions that matches rDNS patterns of consumer
ISPs worldwide.  It evaluates during the SMTP session on client or HELO
rDNS string, REJECTs on a dynamic match or PREPENDs on most generic but
static looking rDNS.  It's much much faster than doing such tests in SA,
and uses far far fewer CPU/memory resources.  And it's one of the
reasons I avoided using SA, or any content filter, for many many years.

-- 
Stan




Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Adam Katz <an...@khopis.com>.
On Sat, 12 Oct 2013, Stan Hoeppner wrote:
>> and engage in discussion WRT lowering the score, eliminating the
>> overlap with the other bare IP HELO rules, etc?

On 10/12/2013 07:28 PM, John Hardin wrote:
> It seems that 94% of the ham hits in masscheck are against list mail,
> and none of the spam hits are, so it would seem reasonable to add an
> exclusion for list messages.
>
> Maddoc hasn't touched these rules since 2009, so I will go ahead and
> add an exclusion for that.

Actually, the overlap issue is quite real.  These two rules
<http://ruleqa.spamassassin.org/?daterev=20131014-r1531815-n&rule=FSL_HELO_BARE_IP_2+RCVD_NUMERIC_HELO&srcpath=&g=Change>
are quite similar:

MSECS 	SPAM% 	HAM% 	S/O 	RANK 	SCORE 	NAME
0 	60.7267 	0.3533 	0.994 	0.85 	2.00 	FSL_HELO_BARE_IP_2
<http://ruleqa.spamassassin.org/20131014-r1531815-n/FSL_HELO_BARE_IP_2/detail>

0 	56.8567 	0.0784 	0.999 	0.97 	0.00 	RCVD_NUMERIC_HELO
<http://ruleqa.spamassassin.org/20131014-r1531815-n/RCVD_NUMERIC_HELO/detail>


overlap spam: 99% of RCVD_NUMERIC_HELO
<http://ruleqa.spamassassin.org/20131014-r1531815-n/RCVD_NUMERIC_HELO/detail>
hits also hit FSL_HELO_BARE_IP_2
<http://ruleqa.spamassassin.org/20131014-r1531815-n/FSL_HELO_BARE_IP_2/detail>;
93% of FSL_HELO_BARE_IP_2
<http://ruleqa.spamassassin.org/20131014-r1531815-n/FSL_HELO_BARE_IP_2/detail>
hits also hit RCVD_NUMERIC_HELO
<http://ruleqa.spamassassin.org/20131014-r1531815-n/RCVD_NUMERIC_HELO/detail>
(ham 100%)
overlap spam: 93% of FSL_HELO_BARE_IP_2
<http://ruleqa.spamassassin.org/20131014-r1531815-n/FSL_HELO_BARE_IP_2/detail>
hits also hit RCVD_NUMERIC_HELO
<http://ruleqa.spamassassin.org/20131014-r1531815-n/RCVD_NUMERIC_HELO/detail>;
99% of RCVD_NUMERIC_HELO
<http://ruleqa.spamassassin.org/20131014-r1531815-n/RCVD_NUMERIC_HELO/detail>
hits also hit FSL_HELO_BARE_IP_2
<http://ruleqa.spamassassin.org/20131014-r1531815-n/FSL_HELO_BARE_IP_2/detail>
(ham 22%)

That's a lot of overlap.  FSL_HELO_BARE_IP_2 may be well served by
excluding RCVD_NUMERIC_HELO.  Given its higher S/O, that might even get
the latter rule a score again (I assume the zero score came from John's
exclusion and a preference towards FSL_HELO_BARE_IP_2).


Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Stan Hoeppner <st...@hardwarefreak.com>.
On 10/13/2013 11:07 AM, John Hardin wrote:
...
> Yes. It will take a day or two to make it through masscheck. And we've
> had corpora starvation issues the last few weeks; if the ham corpus gets
> thin again updates may be delayed.

FSL_HELO_BARE_IP_2

 3am CDT w/score 2.8
11am CDT w/score 2.4

Update was Oct 13 07:19.  So something adjusted the score down already,
but not by much.

...
> Lowering the required score *increases* potential peril, as you have
> seen. 

I was actually referring to administration peril, not ham/spam FPs.

> The scores generated by the masscheck process are designed to
> optimize detection with the required score set to 5 points. If you lower
> the required score you increase the risk of FPs.

I'll but it back up a bit.

> If you're seeing a particular type of spam that isn't quite scoring high
> enough using the stock rules, let the list know. We can work out custom
> rules for you and/or add new rules to the sandboxes for testing against
> the masscheck corpora and possible inclusion in the standard rules.

Will do.  Thanks for being so responsive.

-- 
Stan

P.S.  Thanks for supporting the 2nd amendment.


Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 10/13/2013 2:17 PM, John Hardin wrote:
> On Sun, 13 Oct 2013, Kevin A. McGrail wrote:
>
>> On 10/13/2013 12:33 PM, John Hardin wrote:
>>>  On Sun, 13 Oct 2013, John Hardin wrote:
>>>
>>> >  And we've had corpora starvation issues the last few weeks; if 
>>> the ham >  corpus gets thin again updates may be delayed.
>>>
>>>  Yeah, we're starved for ham again; I don't know how quickly this 
>>> change
>>>  will go out, sorry. 
>>
>> Actually, I don't think Darxus' monitoring script is in sync with the 
>> system because we've put out rules for the past 2 months plus every 
>> single night.
>
> What's the ham threshold at? Has it been reduced from 150k?
>
> Last night's net masscheck had 110691 ham, which shouldn't be enough 
> to publish, correct?
>
> Today's is at 126214 so far, and unless more comes in that's not 
> enough to publish either, correct?
>
> Odd, the current corpus quality check reports 127492 total ham.
>
>
> Regardless, I will check tomorrow and see if the changes are in the 
> latest rules update.
>
>

I think the thresholds are the same but for example, here's a snippet of 
the cron for last night:

  HAM: 203085 (150000 required)
SPAM: 369343 (150000 required)

I know like a year ago I tweaked some threshold settings for date ranges 
so maybe there is a discrepancy between the ruleqa and this?

When in doubt, though, the cron output and the atd command output that 
updates the zone to notify for a new rules release has fired every night 
correctly since at least August 9th.  I can't easily check before that 
in my logs.



Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by John Hardin <jh...@impsec.org>.
On Sun, 13 Oct 2013, Kevin A. McGrail wrote:

> On 10/13/2013 12:33 PM, John Hardin wrote:
>>  On Sun, 13 Oct 2013, John Hardin wrote:
>> 
>> >  And we've had corpora starvation issues the last few weeks; if the ham 
>> >  corpus gets thin again updates may be delayed.
>>
>>  Yeah, we're starved for ham again; I don't know how quickly this change
>>  will go out, sorry. 
>
> Actually, I don't think Darxus' monitoring script is in sync with the system 
> because we've put out rules for the past 2 months plus every single night.

What's the ham threshold at? Has it been reduced from 150k?

Last night's net masscheck had 110691 ham, which shouldn't be enough to 
publish, correct?

Today's is at 126214 so far, and unless more comes in that's not enough to 
publish either, correct?

Odd, the current corpus quality check reports 127492 total ham.


Regardless, I will check tomorrow and see if the changes are in the latest 
rules update.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   For those who are being swayed by Microsoft's whining about the
   GPL, consider how aggressively viral their Shared Source license is:
   If you've *ever* seen *any* MS code covered by the Shared Source
   license, you're infected for life. MS can sue you for Intellectual
   Property misappropriation whenever they like, so you'd better not
   come up with any Innovative Ideas that they want to Embrace...
-----------------------------------------------------------------------
  500 days since the first successful private support mission to ISS (SpaceX)

Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 10/13/2013 12:33 PM, John Hardin wrote:
> On Sun, 13 Oct 2013, John Hardin wrote:
>
>> And we've had corpora starvation issues the last few weeks; if the 
>> ham corpus gets thin again updates may be delayed.
>
> Yeah, we're starved for ham again; I don't know how quickly this 
> change will go out, sorry. 
Actually, I don't think Darxus' monitoring script is in sync with the 
system because we've put out rules for the past 2 months plus every 
single night.

Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by John Hardin <jh...@impsec.org>.
On Sun, 13 Oct 2013, John Hardin wrote:

> And we've had corpora starvation issues the last few weeks; if the ham 
> corpus gets thin again updates may be delayed.

Yeah, we're starved for ham again; I don't know how quickly this change 
will go out, sorry.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The third basic rule of firearms safety:
   Keep your booger hook off the bang switch!
-----------------------------------------------------------------------
  500 days since the first successful private support mission to ISS (SpaceX)

Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by John Hardin <jh...@impsec.org>.
On Sun, 13 Oct 2013, Stan Hoeppner wrote:

> On 10/12/2013 9:28 PM, John Hardin wrote:
>> On Sat, 12 Oct 2013, Stan Hoeppner wrote:
>>
>>> Steve, the one who wrote this regex, would you please explain your
>>> reasoning behind giving this rule a score so high as 2.8,
>>
>> That score was auto-assigned by masscheck, where it is doing quite well:
>>
>> http://ruleqa.spamassassin.org/?rule=FSL_HELO_BARE_IP_2
>>
>>> and engage in discussion WRT lowering the score, eliminating the
>>> overlap with the other bare IP HELO rules, etc?
>>
>> It seems that 94% of the ham hits in masscheck are against list mail,
>> and none of the spam hits are, so it would seem reasonable to add an
>> exclusion for list messages.
>
> That seems to be what I'm seeing here.  That exclusion would be nice.
>
>> Maddoc hasn't touched these rules since 2009, so I will go ahead and add
>> an exclusion for that.
>
> Great.  Thank you.  I assume this exclusion will be picked up via the
> daily update script?

Yes. It will take a day or two to make it through masscheck. And we've had 
corpora starvation issues the last few weeks; if the ham corpus gets thin 
again updates may be delayed.

> On 10/12/2013 9:22 PM, John Hardin wrote:
>> On Sat, 12 Oct 2013, Stan Hoeppner wrote:
>>
>>> Content analysis details:   (4.8 points, 4.2 required)
>>
>> Why did you lower the required score?
>
> Frankly, because I am not, and do not wish to become, an SA expert, with
> all the time/effort that entails.  Bringing the required score down
> progressively until I found some "balance" seemed a better strategy,
> less fraught with potential peril than modifying the scores of
> individual stock rules, creating a bunch of custom rules, etc.  Until
> somewhat recently that strategy seemed to be working relatively well.

Lowering the required score *increases* potential peril, as you have seen. 
The scores generated by the masscheck process are designed to optimize 
detection with the required score set to 5 points. If you lower the 
required score you increase the risk of FPs.

If you're seeing a particular type of spam that isn't quite scoring high 
enough using the stock rules, let the list know. We can work out custom 
rules for you and/or add new rules to the sandboxes for testing against 
the masscheck corpora and possible inclusion in the standard rules.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The third basic rule of firearms safety:
   Keep your booger hook off the bang switch!
-----------------------------------------------------------------------
  500 days since the first successful private support mission to ISS (SpaceX)

Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Stan Hoeppner <st...@hardwarefreak.com>.
On 10/12/2013 9:28 PM, John Hardin wrote:
> On Sat, 12 Oct 2013, Stan Hoeppner wrote:
> 
>> Steve, the one who wrote this regex, would you please explain your
>> reasoning behind giving this rule a score so high as 2.8,
> 
> That score was auto-assigned by masscheck, where it is doing quite well:
> 
> http://ruleqa.spamassassin.org/?rule=FSL_HELO_BARE_IP_2
> 
>> and engage in discussion WRT lowering the score, eliminating the
>> overlap with the other bare IP HELO rules, etc?
> 
> It seems that 94% of the ham hits in masscheck are against list mail,
> and none of the spam hits are, so it would seem reasonable to add an
> exclusion for list messages.

That seems to be what I'm seeing here.  That exclusion would be nice.

> Maddoc hasn't touched these rules since 2009, so I will go ahead and add
> an exclusion for that.

Great.  Thank you.  I assume this exclusion will be picked up via the
daily update script?


On 10/12/2013 9:22 PM, John Hardin wrote:> On Sat, 12 Oct 2013, Stan
Hoeppner wrote:
>
>> Content analysis details:   (4.8 points, 4.2 required)
>
> Why did you lower the required score?

Frankly, because I am not, and do not wish to become, an SA expert, with
all the time/effort that entails.  Bringing the required score down
progressively until I found some "balance" seemed a better strategy,
less fraught with potential peril than modifying the scores of
individual stock rules, creating a bunch of custom rules, etc.  Until
somewhat recently that strategy seemed to be working relatively well.

-- 
Stan




Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by John Hardin <jh...@impsec.org>.
On Sat, 12 Oct 2013, Stan Hoeppner wrote:

> Steve, the one who wrote this regex, would you please explain your
> reasoning behind giving this rule a score so high as 2.8,

That score was auto-assigned by masscheck, where it is doing quite well:

http://ruleqa.spamassassin.org/?rule=FSL_HELO_BARE_IP_2

> and engage in discussion WRT lowering the score, eliminating the overlap 
> with the other bare IP HELO rules, etc?

It seems that 94% of the ham hits in masscheck are against list mail, and 
none of the spam hits are, so it would seem reasonable to add an exclusion 
for list messages.

Maddoc hasn't touched these rules since 2009, so I will go ahead and add 
an exclusion for that.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   End users want eye candy and the "ooo's and aaaahhh's" experience
   when reading mail. To them email isn't a tool, but an entertainment
   form.                                                 -- Steve Lake
-----------------------------------------------------------------------
  499 days since the first successful private support mission to ISS (SpaceX)

Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Stan Hoeppner <st...@hardwarefreak.com>.
On 10/12/2013 1:04 PM, Benny Pedersen wrote:
> Stan Hoeppner skrev den 2013-10-12 18:26:
> 
>> FSL_HELO_BARE_IP_2 needs to have a -much- lower score, or be eliminated
>> entirely, as it overlaps with at least 3 other tests, as pointed out
>> previously by another user.  If a message makes it through Gmane, and
>> Debian, and then gets flagged by my "stock" rules introduced through an
>> auto update, then something is obviously wrong.
> 
> problem is that one in the chain of delivery mail used a ip in helo it
> does not matter that debian did not here, hmm :)
> 
> ask debian maillist maintainers to create a bug for missing spf and dkim
> ?, so receipients can whitelist something from this maillist ?
...

Benny, I see your lack of perception and insight is as profound here as
on Postfix-users.  Please do not reply to my posts, here, there, or
anywhere.


Steve, the one who wrote this regex, would you please explain your
reasoning behind giving this rule a score so high as 2.8, and engage in
discussion WRT lowering the score, eliminating the overlap with the
other bare IP HELO rules, etc?

The introduction of this rule has caused a large amount of list mail to
be falsely tagged, and is not causing any additional spam to be tagged.
 All it's doing here is causing FPs.  FYI, 99.5%+ of my inbound non-spam
mail stream is list mail.  So simply whitelisting the lists, as many
might suggest, is not an option.  I use SA specifically to tag spam
coming from the lists because checks during the SMTP session are
obviously useless here.  Until this new rule it was working relatively
well.  Some spam was getting through, but I wasn't seeing all these FPs.

Thanks.

-- 
Stan


Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Benny Pedersen <me...@junc.eu>.
Stan Hoeppner skrev den 2013-10-12 18:26:

> FSL_HELO_BARE_IP_2 needs to have a -much- lower score, or be 
> eliminated
> entirely, as it overlaps with at least 3 other tests, as pointed out
> previously by another user.  If a message makes it through Gmane, and
> Debian, and then gets flagged by my "stock" rules introduced through 
> an
> auto update, then something is obviously wrong.

problem is that one in the chain of delivery mail used a ip in helo it 
does not matter that debian did not here, hmm :)

ask debian maillist maintainers to create a bug for missing spf and 
dkim ?, so receipients can whitelist something from this maillist ?

remember spamaassassin does not block, it just scores, if scores are 
incorrect add more rules to help it score right will do

> Content analysis details:   (4.8 points, 4.2 required)

>  2.8 FSL_HELO_BARE_IP_2     FSL_HELO_BARE_IP_2
>  1.2 RCVD_NUMERIC_HELO      Received: contains an IP address used for 
> HELO
>  0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
>                             [score: 0.5314]

yep a overlap, but still under 5.0 in score ?

for the overlap create a bug might help more



Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Stan Hoeppner <st...@hardwarefreak.com>.
On 10/15/2013 4:15 PM, David B Funk wrote:
> On Mon, 14 Oct 2013, Stan Hoeppner wrote:
> 
>> On 10/14/2013 2:47 PM, Adam Katz wrote:
>>> On 10/12/2013 09:26 AM, Stan Hoeppner wrote:
>>>> These two rules are adding 4.0 pts [...]
>>>> Content analysis details:   (4.8 points, 4.2 required)
>>>>  pts rule name              description
>>>> ----
>>>> ---------------------------------------------------------------------
>>>>  2.8 FSL_HELO_BARE_IP_2     FSL_HELO_BARE_IP_2
>>>>  1.2 RCVD_NUMERIC_HELO      Received: contains an IP address used
>>>> for HELO
>>>>  0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
>>>>                             [score: 0.5314]
>>>
>>> The others have addressed the "two rules" you mentioned, so I'll leave
>>> that alone in this email.
>>>
>>> There's more here than that:  If you're using Bayes, you have to train
>>> it.  Right now, it's hurting you:  Those 0.8 points should be some
>>> negative value, perhaps -1.9 or -0.5 (the default scores for BAYES_00
>>> and BAYES_05), which would then have made that message score 2.1 or 3.5,
>>> both of which are below your 4.2 threshold (which is already too low!).
>>
>> There's no doubt my Bayes isn't working.  I ran a few hundred each of
>> ham and spam through sa-learn just after installing SA some year+ ago.
>> I haven't regularly fed it since, though I have run through maybe a few
>> dozen spam that weren't scored high enough.  And I think I may have
>> inadvertently run through one or two msgs that had anti-Bayesian text
>> blocks in them-- the bible versus, wikipedia content, etc.
>>
>> I just ran 120 hams through, about half were msgs tagged previously with
>> Bayes_60 through Bayes_95.
>>
>> ~$ sa-learn --ham --mbox --progress /home/stan/mail/ham
>> Learned tokens from 0 message(s) (0 message(s) examined)
>>
>> Obviously there's a problem with no tokens learned.  A few questions:
>>
>> 1.  Is the database the problem?  If so...

Thanks for the reply David.

> When it says "(0 message(s) examined)" that shows that it was unable
> to parse -any- messages out of that input file. This tends to imply that
> the contents of that "/home/stan/mail/ham" file are not a "mbox" format
> or it's an empty mailbox.

No, actually the file doesn't exist.  We have a case sensitivity error.
 Why doesn't sa-learn return the filesystem error?  I'd have instantly
caught my typo if it did.

$ more /home/stan/mail/ham
/home/stan/mail/ham: No such file or directory

> First thing to fix, get your input recoginised as messages. Then see how
> they're being learned.

Note the caps "H" in Ham.

~$ sa-learn --ham --mbox --progress /home/stan/mail/Ham
...
Learned tokens from 113 message(s) (116 message(s) examined)

113/116 is promising.  I'll keep feeding more ham through and we'll see
if these FPs start to fall.  Thanks again David for helping me catch a
simple typo.  Maybe we could someday get sa-learn to properly return
error msgs?

-- 
Stan


Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by David B Funk <db...@engineering.uiowa.edu>.
On Mon, 14 Oct 2013, Stan Hoeppner wrote:

> On 10/14/2013 2:47 PM, Adam Katz wrote:
>> On 10/12/2013 09:26 AM, Stan Hoeppner wrote:
>>> These two rules are adding 4.0 pts [...]
>>> Content analysis details:   (4.8 points, 4.2 required)
>>>  pts rule name              description
>>> ---- ---------------------------------------------------------------------
>>>  2.8 FSL_HELO_BARE_IP_2     FSL_HELO_BARE_IP_2
>>>  1.2 RCVD_NUMERIC_HELO      Received: contains an IP address used for HELO
>>>  0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
>>>                             [score: 0.5314]
>>
>> The others have addressed the "two rules" you mentioned, so I'll leave
>> that alone in this email.
>>
>> There's more here than that:  If you're using Bayes, you have to train
>> it.  Right now, it's hurting you:  Those 0.8 points should be some
>> negative value, perhaps -1.9 or -0.5 (the default scores for BAYES_00
>> and BAYES_05), which would then have made that message score 2.1 or 3.5,
>> both of which are below your 4.2 threshold (which is already too low!).
>
> There's no doubt my Bayes isn't working.  I ran a few hundred each of
> ham and spam through sa-learn just after installing SA some year+ ago.
> I haven't regularly fed it since, though I have run through maybe a few
> dozen spam that weren't scored high enough.  And I think I may have
> inadvertently run through one or two msgs that had anti-Bayesian text
> blocks in them-- the bible versus, wikipedia content, etc.
>
> I just ran 120 hams through, about half were msgs tagged previously with
> Bayes_60 through Bayes_95.
>
> ~$ sa-learn --ham --mbox --progress /home/stan/mail/ham
> Learned tokens from 0 message(s) (0 message(s) examined)
>
> Obviously there's a problem with no tokens learned.  A few questions:
>
> 1.  Is the database the problem?  If so...

When it says "(0 message(s) examined)" that shows that it was unable
to parse -any- messages out of that input file. This tends to imply that
the contents of that "/home/stan/mail/ham" file are not a "mbox" format
or it's an empty mailbox.

First thing to fix, get your input recoginised as messages. Then see how
they're being learned.

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Stan Hoeppner <st...@hardwarefreak.com>.
On 10/14/2013 2:47 PM, Adam Katz wrote:
> On 10/12/2013 09:26 AM, Stan Hoeppner wrote:
>> These two rules are adding 4.0 pts [...]
>> Content analysis details:   (4.8 points, 4.2 required)
>>  pts rule name              description
>> ---- ---------------------------------------------------------------------
>>  2.8 FSL_HELO_BARE_IP_2     FSL_HELO_BARE_IP_2
>>  1.2 RCVD_NUMERIC_HELO      Received: contains an IP address used for HELO
>>  0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
>>                             [score: 0.5314]
> 
> The others have addressed the "two rules" you mentioned, so I'll leave
> that alone in this email.
> 
> There's more here than that:  If you're using Bayes, you have to train
> it.  Right now, it's hurting you:  Those 0.8 points should be some
> negative value, perhaps -1.9 or -0.5 (the default scores for BAYES_00
> and BAYES_05), which would then have made that message score 2.1 or 3.5,
> both of which are below your 4.2 threshold (which is already too low!).

There's no doubt my Bayes isn't working.  I ran a few hundred each of
ham and spam through sa-learn just after installing SA some year+ ago.
I haven't regularly fed it since, though I have run through maybe a few
dozen spam that weren't scored high enough.  And I think I may have
inadvertently run through one or two msgs that had anti-Bayesian text
blocks in them-- the bible versus, wikipedia content, etc.

I just ran 120 hams through, about half were msgs tagged previously with
Bayes_60 through Bayes_95.

~$ sa-learn --ham --mbox --progress /home/stan/mail/ham
Learned tokens from 0 message(s) (0 message(s) examined)

Obviously there's a problem with no tokens learned.  A few questions:

1.  Is the database the problem?  If so...

2.  Is there a way to flush the Bayes database and restart training?

3.  Should I even be using Bayes on a mail stream that is over
    99.9% technical list mail, replete with lots of C code?  For
    some reason my Bayes likes to add +3-3.5 to most messages from
    the XFS list that contain code.

> On that threshold:  there are better ways to nail more spam than
> lowering the threshold.  SpamAssassin is highly tuned for 5.0 and while
> it's safe to bump that threshold up (more conservative, e.g. I block at
> 8.0 and flag at 5.0), it is not as safe to pull it down.

I'd guess your mail stream is quite different.  FYI, of all non-list
mail entering my MX, I block about 98% of spam at SMTP before it enters
the queue.  SA does pretty good at catching the last 2% and AFAIK it has
never tagged any non-list mail.  WRT the list mail it does just 'ok'
tagging spam.  But it FPs about twice as many ham.  I haven't kept track
of hard numbers.  It's just not worth the time frankly.

I installed SA with only one goal in mind:  stop the "last 2%" -- the
list born spam which I can't touch with SMTP restrictions, and the very
few non-list spam that make it into the queue.  This is a tall order,
which is why I had low hopes from the start.  I subbed the list not
because the spam catch rate was too low to tolerate, but because ham
tagging suddenly went through the roof due to one rule.

> Better way #1: plugins.  Razor2, Pyzor, DCC.  Decently drop-in (though
> DCC isn't as easy as it once was).
> 
> Better way #2: Bayes.  Set it up to facilitate better training.  Create
> "learn-spam" and "learn-nonspam" folders for each user and run cron jobs
> that run sa-learn (or better, spamassassin -r so you can learn and
> report them) and then empty the folders.  Once you can trust Bayes, you
> can increase the magnitude of its scores.  Do this slowly and carefully.

Given that the overall content of my list mail doesn't change much day
to day, or over time, I would thing that manually training ham once with
a few hundred msgs would be sufficient.  Then train spam on occasion.
Which is what I've done.

My guess is that my use case is unique enough that I'll need to do a lot
of manual tuning, creating custom rules, etc, to get SA working somewhat
well, without the occasional FP spike that brought me here.  Which is
exactly what I do NOT want to do.  I've spent years tuning Postfix to
nail 98% of wire bound spam.  I'd rather not spend many more years
tweaking SA to catch the last 2% sneaking in through a few mailing lists...

> Better way #3: AWL.  This is now disabled by default, in part due to
> misunderstandings (it is horribly named; it's as much a black list as it
> is a white list, and it's not as "persistent" as its storage model
> purports). <snip>

Which is exactly why I didn't enable it.

>> Received: from bendel.debian.org (bendel.debian.org [82.195.75.100])
>> 	by greer.hardwarefreak.com (Postfix) with ESMTP id C95BD6C0CE
>> 	for <st...@hardwarefreak.com>; Sat, 12 Oct 2013 10:23:37 -0500 (CDT)
>> [...]
>> X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on bendel.debian.org
>> X-Spam-Level:
>> X-Spam-Status: No, score=-9.6 required=4.0 tests=FOURLA,FREEMAIL_FROM,
>> 	LDOSUBSCRIBER,LDO_WHITELIST,RCVD_NUMERIC_HELO,T_RP_MATCHES_RCVD,
>> 	T_TO_NO_BRKTS_FREEMAIL autolearn=unavailable version=3.3.2
>> [...]
>> X-Amavis-Spam-Status: No, score=-5.735 tagged_above=-10000 required=5.3
>> 	tests=[BAYES_00=-2, FOURLA=0.1, FREEMAIL_FROM=0.001, LDO_WHITELIST=-5,
>> 	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_NUMERIC_HELO=1.164,
>> 	T_RP_MATCHES_RCVD=-0.01, T_TO_NO_BRKTS_FREEMAIL=0.01] autolearn=ham
> 
> Another option is to trust Debian's SA instance.  You can add
> 82.195.75.100 to trusted_networks in your local.cf.  Be careful, this
> would mean inheriting some of Debian's false negatives.

That makes little sense, given my stated reasons for using SA.

-- 
Stan


Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by Adam Katz <an...@khopis.com>.
On 10/12/2013 09:26 AM, Stan Hoeppner wrote:
> These two rules are adding 4.0 pts [...]
> Content analysis details:   (4.8 points, 4.2 required)
>  pts rule name              description
> ---- ---------------------------------------------------------------------
>  2.8 FSL_HELO_BARE_IP_2     FSL_HELO_BARE_IP_2
>  1.2 RCVD_NUMERIC_HELO      Received: contains an IP address used for HELO
>  0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
>                             [score: 0.5314]

The others have addressed the "two rules" you mentioned, so I'll leave
that alone in this email.

There's more here than that:  If you're using Bayes, you have to train
it.  Right now, it's hurting you:  Those 0.8 points should be some
negative value, perhaps -1.9 or -0.5 (the default scores for BAYES_00
and BAYES_05), which would then have made that message score 2.1 or 3.5,
both of which are below your 4.2 threshold (which is already too low!).

On that threshold:  there are better ways to nail more spam than
lowering the threshold.  SpamAssassin is highly tuned for 5.0 and while
it's safe to bump that threshold up (more conservative, e.g. I block at
8.0 and flag at 5.0), it is not as safe to pull it down.

Better way #1: plugins.  Razor2, Pyzor, DCC.  Decently drop-in (though
DCC isn't as easy as it once was).

Better way #2: Bayes.  Set it up to facilitate better training.  Create
"learn-spam" and "learn-nonspam" folders for each user and run cron jobs
that run sa-learn (or better, spamassassin -r so you can learn and
report them) and then empty the folders.  Once you can trust Bayes, you
can increase the magnitude of its scores.  Do this slowly and carefully.

Better way #3: AWL.  This is now disabled by default, in part due to
misunderstandings (it is horribly named; it's as much a black list as it
is a white list, and it's not as "persistent" as its storage model
purports).  This nudges a sender's mail towards its previous average
score.  Set it up site-wide, /not/ per-user, and start it with a low
factor (say 0.1) until you can trust it, slowly increasing it up to 0.5
(you can go higher, but I wouldn't go too much higher; I use 0.333). 
Keep in mind that AWL doesn't clean up after itself the way Bayes does,
so the DB will grow over time.  There are limited guides online for how
to prune it.

> Received: from bendel.debian.org (bendel.debian.org [82.195.75.100])
> 	by greer.hardwarefreak.com (Postfix) with ESMTP id C95BD6C0CE
> 	for <st...@hardwarefreak.com>; Sat, 12 Oct 2013 10:23:37 -0500 (CDT)
> [...]
> X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on bendel.debian.org
> X-Spam-Level:
> X-Spam-Status: No, score=-9.6 required=4.0 tests=FOURLA,FREEMAIL_FROM,
> 	LDOSUBSCRIBER,LDO_WHITELIST,RCVD_NUMERIC_HELO,T_RP_MATCHES_RCVD,
> 	T_TO_NO_BRKTS_FREEMAIL autolearn=unavailable version=3.3.2
> [...]
> X-Amavis-Spam-Status: No, score=-5.735 tagged_above=-10000 required=5.3
> 	tests=[BAYES_00=-2, FOURLA=0.1, FREEMAIL_FROM=0.001, LDO_WHITELIST=-5,
> 	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_NUMERIC_HELO=1.164,
> 	T_RP_MATCHES_RCVD=-0.01, T_TO_NO_BRKTS_FREEMAIL=0.01] autolearn=ham

Another option is to trust Debian's SA instance.  You can add
82.195.75.100 to trusted_networks in your local.cf.  Be careful, this
would mean inheriting some of Debian's false negatives.

Re: FSL_HELO_BARE_IP_2 & RCVD_NUMERIC_HELO

Posted by John Hardin <jh...@impsec.org>.
On Sat, 12 Oct 2013, Stan Hoeppner wrote:

> Content analysis details:   (4.8 points, 4.2 required)

Why did you lower the required score?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   End users want eye candy and the "ooo's and aaaahhh's" experience
   when reading mail. To them email isn't a tool, but an entertainment
   form.                                                 -- Steve Lake
-----------------------------------------------------------------------
  499 days since the first successful private support mission to ISS (SpaceX)