You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Rich Shepard <rs...@appl-ecosys.com> on 2009/06/01 18:28:27 UTC

Identifying Source of False Positives

   I'm running SA-3.2.5 on Slackware-12.2 and encountering false positives on
messages that have not before been seen as spam by SA. Specifically, the
daily postfix mail log summary report and the daily logwatch report are
marked at spam; they are sent by root to me as a user. Because
/etc/procmailrc threw these messages away it took a long time to figure out
that it was SA mis-labeling these messages that was the immediate problem.

   Over the past few months I've also had problems with messages from three
specific domains that were never delivered to my inbox. However, when a
procmail recipe directed all messages to me at my business domain to a
different mail file, they were delivered.

   How can I determine what causes SA to mark the log summary reports as
spam? This is the first issue I want to resolve. I saw nothing appropriate
on the web site's FAQ or front page so if I missed the information please
point me to the appropriate location.

Rich

Re: Identifying Source of False Positives

Posted by John Hardin <jh...@impsec.org>.
On Mon, 1 Jun 2009, Rich Shepard wrote:

> On Mon, 1 Jun 2009, John Hardin wrote:
>
>>  Have you kept your spam and ham corpa?
>
>  I'm not sure. The spam comes from the spam-uncaught file which is 
> cleared each time it's run.

Pity. If you're manually training it's a very good idea to retain your 
corpa so you can review training and retrain from scratch if needed.

>>  Okay, let's key on that one.
>> 
>> >  ## Call SpamAssassin
>> > :  0fw: spamassassin.lock
>> >  * < 256000
>> > |   spamassassin
>> 
>> : 0 fw: spamassassin.lock
>>  * < 256000
>>  * ! ^TO_abuse@
>>  * ! ^List-Id: .*<?users[@.]spamassassin\.apache\.org>?
>>  * ! ^Received: from salmo\.appl-ecosys\.com \(localhost\.localdomain
>>  \[127\.0\.0\.1\]) by salmo\.appl-ecosys\.com
>> |  /usr/bin/spamc
>>
>>  Using spamc creates less load than launching spamassassin from scratch
>>  for every email, but you do have to manage the daemon (i.e. restart it
>>  if the rules change).
>
>  I run spamd:
>
>  2978 ?        Ss    12:16 /usr/bin/spamd -d --pidfile=/var/run/spamd.pid
>  3052 ?        S      0:04 spamd child
>  3054 ?        S      0:05 spamd child
>
> is this not adequate for a light load?

That's fine. If you're currently running spamd, then having procmail call 
spamassassin is wasteful. That recompiles all of the rules from scratch 
for every message you receive, where using spamc/spamd compiles the rules 
once when you restart the daemon.

>>  Are your resources really so limited that you want to serialize all
>>  email delivery? As a middle ground you might consider per-user
>>  lockfiles instead, e.g.:
>
>> : 0 fw: $HOME/.spamassassin.lock
>>
>>  I'd also suggest upping the size limit a bit, but that's not a big issue.
>>
>>  There are more complex things you can do; you might want to take a
>>  look at http://www.impsec.org/~jhardin/antispam/spamassassin.procmail
>
>  There are only two users on this network and a low mail volume for each 
> of us.

Ok, then your locking should work okay.

> I'll keep teaching SA that the log reports are ham and see if that makes 
> a difference.

It will help, though it may take a while to override their current 
learning as spam.

> As I wrote earlier, this is all within the past quarter year,
> and it's been a PITA since it's taken time and attention away from my
> business.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...to announce there must be no criticism of the President or to
   stand by the President right or wrong is not only unpatriotic and
   servile, but is morally treasonous to the American public.
                                           -- Theodore Roosevelt, 1918
-----------------------------------------------------------------------
  5 days until the 65th anniversary of D-Day

Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Mon, 1 Jun 2009, John Hardin wrote:

> And I assume you look at the sapm-uncaught file before learning it?

   Yes. The messages in there are those I deliberately move there after
they've ended up in my inbox because neither the postfix filters nor the
spamassassin rules caught them.

> If some log files got in there and were learned, that could explain the 
> deterioration.

   That seems very reasonable, but I would have had to move them there myself
and I cannot recall doing so. Also, before running sa-learn to classify them
as spam I look over the list. So, it's quite possible that they ended up
classified as spam unintentionally.

> Have you kept your spam and ham corpa?

   I'm not sure. The spam comes from the spam-uncaught file which is cleared
each time it's run. The ham comes from various mail lists and they grow over
time.

> Okay, let's key on that one.
>
>> ## Call SpamAssassin
>> : 0fw: spamassassin.lock
>> * < 256000
>> |  spamassassin
>
> :0 fw: spamassassin.lock
> * < 256000
> * ! ^TO_abuse@
> * ! ^List-Id: .*<?users[@.]spamassassin\.apache\.org>?
> * ! ^Received: from salmo\.appl-ecosys\.com \(localhost\.localdomain 
> \[127\.0\.0\.1\]) by salmo\.appl-ecosys\.com
> | /usr/bin/spamc
>
> Using spamc creates less load than launching spamassassin from scratch for 
> every email, but you do have to manage the daemon (i.e. restart it if the 
> rules change).

   I run spamd:

  2978 ?        Ss    12:16 /usr/bin/spamd -d --pidfile=/var/run/spamd.pid
  3052 ?        S      0:04 spamd child
  3054 ?        S      0:05 spamd child

is this not adequate for a light load?

> Are your resources really so limited that you want to serialize all email 
> delivery? As a middle ground you might consider per-user lockfiles instead, 
> e.g.:

>   :0 fw: $HOME/.spamassassin.lock
>
> I'd also suggest upping the size limit a bit, but that's not a big issue.
>
> There are more complex things you can do; you might want to take a look at 
> http://www.impsec.org/~jhardin/antispam/spamassassin.procmail

   There are only two users on this network and a low mail volume for each of
us.

   The size limit has been at that value for years without a problem. I'll
keep teaching SA that the log reports are ham and see if that makes a
difference. As I wrote earlier, this is all within the past quarter year,
and it's been a PITA since it's taken time and attention away from my
business.

Thanks,

Rich

Re: Identifying Source of False Positives

Posted by John Hardin <jh...@impsec.org>.
On Mon, 1 Jun 2009, Rich Shepard wrote:

> On Mon, 1 Jun 2009, John Hardin wrote:
>
>>  If these are system-generated messages, something is improperly training
>>  SA that they are spam. Do you use autolearn?
>
> John,
>
>  No. Once a week or so I run sa-learn specifying spam on the spam-uncaught
> mbox file. Less frequently I run it on mail list files specifying them as
> ham.

And I assume you look at the sapm-uncaught file before learning it?

If some log files got in there and were learned, that could explain the 
deterioration.

Have you kept your spam and ham corpa? I would suggest wiping your Bayes 
database and retraining it, after reviewing the corpa.

>>  Primarily I'd suggest you exclude locally-generated emails from SA
>>  completely. If you'd post the Received: headers from such a message and
>>  the procmail stanza where you pass messages to SA for scoring I could
>>  suggest something.
>
>  Here are all headers from the mail log summary:
>
>> From root@salmo.appl-ecosys.com Mon Jun  1 11:25:44 2009
> Return-Path: <ro...@salmo.appl-ecosys.com>
> X-Spam-Flag: YES
> X-Spam-Checker-Version: SpamAssassin 3.2.5-ph20040310.0 (2008-06-10) on
> 	salmo.appl-ecosys.com
> X-Spam-Level: ****
> X-Spam-Status: Yes, score=4.9 required=4.0 tests=ALL_TRUSTED,AWL,BAYES_99,
> 	 EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
> 	 autolearn=no version=3.2.5-ph20040310.0
> X-Spam-Report:
> 	 * -1.3 ALL_TRUSTED Passed through trusted hosts only via SMTP
> 	 *  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
> 	 *      [score: 1.0000]
> 	 *  2.5 EMPTY_BODY BODY: Message has subject but no body
> 	 *  0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in
> 	 URL
> 	 *  0.4 URI_HEX URI: URI hostname has long hexadecimal sequence
> 	 *  0.0 NUMERIC_HTTP_ADDR URI: Uses a numeric IP address in URL
> 	 *  1.6 URI_NOVOWEL URI: URI hostname has long non-vowel sequence
> 	 * -1.8 AWL AWL: From: address is in the auto white-list
> X-Original-To: rshepard@appl-ecosys.com
> Delivered-To: rshepard@appl-ecosys.com
> Received: from salmo.appl-ecosys.com (localhost.localdomain [127.0.0.1])
> 	 by salmo.appl-ecosys.com (Postfix) with ESMTP id 8DA0F1026
> 	 for <rs...@appl-ecosys.com>; Mon,  1 Jun 2009 11:25:44 -0700
> 	 (PDT)

Okay, let's key on that one.

> ## Call SpamAssassin
> : 0fw: spamassassin.lock
> * < 256000
> |  spamassassin

:0 fw: spamassassin.lock
* < 256000
* ! ^TO_abuse@
* ! ^List-Id: .*<?users[@.]spamassassin\.apache\.org>?
* ! ^Received: from salmo\.appl-ecosys\.com \(localhost\.localdomain \[127\.0\.0\.1\]) by salmo\.appl-ecosys\.com
| /usr/bin/spamc

Using spamc creates less load than launching spamassassin from scratch for 
every email, but you do have to manage the daemon (i.e. restart it if the 
rules change).

Are your resources really so limited that you want to serialize all email 
delivery? As a middle ground you might consider per-user lockfiles 
instead, e.g.:

    :0 fw: $HOME/.spamassassin.lock

I'd also suggest upping the size limit a bit, but that's not a big issue.

There are more complex things you can do; you might want to take a look at 
http://www.impsec.org/~jhardin/antispam/spamassassin.procmail

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   We have to realize that people who run the government can and do
   change. Our society and laws must assume that bad people -
   criminals even - will run the government, at least part of the
   time.                                               -- John Gilmore
-----------------------------------------------------------------------
  5 days until the 65th anniversary of D-Day

Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Mon, 1 Jun 2009, Bowie Bailey wrote:

> The empty body problem is a more difficult problem. Have procmail save a
> copy of the raw message somewhere and take a look at it. Make sure there
> is a blank line between the headers and the body.

Bowie, et al.:

   Progress is being made. I discovered that the local.cf was for sa-1.3 or
so, and there was a local.cf.new in the same directory. I saved the old
version and made the .new one the working copy. Many fewer rules.

   On a real spam that was saved for my examination I see that the EMPTY_BODY
check was not triggered. I'll watch this a couple of days and see if that
continues to hold true.

   In the meantime, I'm retraining SA on the false positives to teach it that
they are ham rather than spam. When my log summary reports start appearing
in my INBOX and the other false positives from the mail lists (such as this
one), stop appearing in the spam hold mailbox, I'll relax.

   Thank you all for the very helpful suggestions. I'll update the status
over the next days.

Rich

-- 
Richard B. Shepard, Ph.D.               |  Integrity            Credibility
Applied Ecosystem Services, Inc.        |            Innovation
<http://www.appl-ecosys.com>     Voice: 503-667-4517      Fax: 503-667-8863

Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Mon, 1 Jun 2009, Theo Van Dinter wrote:

> My guess is you did something like "spamassassin -D filename", where
> filename gets treated as the argument to -D, so then it was waiting for input.

Theo,

   Yes, this is what I did.

> If this is the case, try "spamassassin -D < filename > /dev/null". :)

   Interesting:

[785] dbg: rules: running uri tests; score so far=1.2
[785] dbg: rules: compiled uri tests
[785] dbg: rules: ran uri rule NORMAL_HTTP_TO_IP ======> got hit:
"http://211.129.107.12"
[785] dbg: rules: ran uri rule URI_HEX ======> got hit:
"http://kemp-5d866973"
[785] dbg: rules: ran uri rule NUMERIC_HTTP_ADDR ======> got hit:
"http://1898218"
[785] dbg: rules: ran uri rule URI_NOVOWEL ======> got hit: "http://jcwpjkp"
[785] dbg: rules: ran uri rule __DOS_HAS_ANY_URI ======> got hit: "h"
[785] dbg: eval: stock info total: 0
[785] warn: rules: failed to run CG_FUJI_JPG test, skipping:
[785] warn:  (Can't locate object method "image_name_regex" via package
"Mail::SpamAssassin::PerMsgStatus" at (eval 719) line 1315.
[785] warn: )
[785] warn: rules: failed to run CG_DOUBLEDOT_GIF test, skipping:
[785] warn:  (Can't locate object method "image_name_regex" via package
"Mail::SpamAssassin::PerMsgStatus" at (eval 719) line 1580.
[785] warn: )
[785] warn: rules: failed to run CG_SONY_JPG test, skipping:
[785] warn:  (Can't locate object method "image_name_regex" via package
"Mail::SpamAssassin::PerMsgStatus" at (eval 719) line 2601.
[785] warn: )
[785] dbg: rules: ran eval rule BAYES_50 ======> got hit (1)
[785] warn: rules: failed to run CG_CANON_JPG test, skipping:
[785] warn:  (Can't locate object method "image_name_regex" via package
"Mail::SpamAssassin::PerMsgStatus" at (eval 719) line 4000.
[785] warn: )
[785] dbg: rules: running rawbody tests; score so far=3.191
[785] dbg: rules: compiled rawbody tests
[785] dbg: rules: running full tests; score so far=3.191
[785] dbg: rules: compiled full tests
[785] dbg: util: current PATH is:
/root/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin:/bin:/usr/lib/java/bin:/usr/lib/java/jre/bin:/usr/lib/java/bin:/usr/lib/java/jre/bin:/usr/lib/qt/bin:/usr/share/texmf/bin
[785] dbg: pyzor: pyzor is not available: no pyzor executable found
[785] dbg: pyzor: no pyzor found, disabling Pyzor
[785] dbg: rules: running meta tests; score so far=3.191
[785] dbg: rules: compiled meta tests
[785] dbg: check: running tests for priority: 500
[785] dbg: dns: harvest_dnsbl_queries
[785] dbg: async: select found 4 responses ready (t.o.=0.0)
[785] dbg: async: completed in 0.149 s: URI-DNSBL,
DNSBL:sbl.spamhaus.org.:10.96.127.75
[785] dbg: async: completed in 0.156 s: URI-DNSBL,
DNSBL:sbl.spamhaus.org.:10.178.19.65
[785] dbg: async: completed in 0.155 s: URI-DNSBL,
DNSBL:sbl.spamhaus.org.:11.25.147.192
[785] dbg: async: completed in 0.155 s: URI-DNSBL,
DNSBL:sbl.spamhaus.org.:110.0.55.209
[785] dbg: async: queries completed: 4, started: 0
[785] dbg: async: queries active: URI-DNSBL=62 URI-NS=10 at Mon Jun 1
15:53:13 2009
[785] dbg: dns: harvest_dnsbl_queries - check_tick
[785] dbg: async: select found 1 responses ready (t.o.=1.0)
[785] dbg: async: completed in 0.158 s: URI-DNSBL,
DNSBL:sbl.spamhaus.org.:39.0.58.80
[785] dbg: async: queries completed: 1, started: 0
[785] dbg: async: queries active: URI-DNSBL=61 URI-NS=10 at Mon Jun 1
15:53:13 2009
[785] dbg: dns: harvest_dnsbl_queries - check_tick
   ...
[785] dbg: check: is spam? score=3.191 required=4
[785] dbg: check:
tests=ALL_TRUSTED,BAYES_50,EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
[785] dbg: check:
subtests=__DATE_700,__DOS_BODY_MON,__DOS_HAS_ANY_URI,__DOS_RCVD_MON,__DOS_REF_TODAY,__ENV_AND_HDR_FROM_MATCH,__FB_NUM_PERCNT,__HAS_ANY_EMAIL,__HAS_ANY_URI,__HAS_MSGID,__HAS_RCVD,__HAS_SUBJECT,__KAM_MED2,__KAM_NUMBER2,__KAM_TIME4,__MISSING_REF,__MSGID_OK_DIGITS,__MSGID_OK_HOST,__MSOE_MID_WRONG_CASE,__NAKED_TO,__NONEMPTY_BODY,__SANE_MSGID,__TOCC_EXISTS,__hk_obfdomreq2

   It suddenly jumps from 1.2 to 3.91 after looking for images. I don't know
where to fix that. I think that I need to update SPF, too, because that's
compiled against an earlier perl version.

Rich

Re: Identifying Source of False Positives

Posted by Theo Van Dinter <fe...@apache.org>.
fwiw, even if there isn't a blank line, SA will figure it out (though
it'll trigger a MISSING_HB_SEP rule hit).

As for the debug output ... it depends, how did you run the command
(ie: what was the command you tried).  My guess is you did something
like "spamassassin -D filename", where filename gets treated as the
argument to -D, so then it was waiting for input.  If this is the
case, try "spamassassin -D < filename > /dev/null". :)

On Mon, Jun 1, 2009 at 6:09 PM, Rich Shepard <rs...@appl-ecosys.com> wrote:
>  There is always a blank line between headers and body. I tried running
> 'spamassassin -D' on the saved message and nothing happened. Should it take
> more than a few seconds to complete and return a debug report?

Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Mon, 1 Jun 2009, Bowie Bailey wrote:

> Your biggest problems here are BAYES_99 and EMPTY_BODY.  To fix the Bayes
> problem, sa-learn some of these messages as ham.  Make sure you are
> learning as the right user...

Bowie,

   I just did this on a run from this morning. I'll do so again tomorrow
morning with both the mail log and log watch reports.

> The empty body problem is a more difficult problem.  Have procmail save a
> copy of the raw message somewhere and take a look at it.  Make sure there
> is a blank line between the headers and the body.  Run 'spamassassin -D'
> on this saved message and look for anything unusual in the debug output.

   There is always a blank line between headers and body. I tried running
'spamassassin -D' on the saved message and nothing happened. Should it take
more than a few seconds to complete and return a debug report?

Thanks,

Rich

Re: [SA] Identifying Source of False Positives -- RESOLVED

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Fri, 5 Jun 2009, Adam Katz wrote:

> Since that regex matches nothing, I assume you meant it to be
> m'^[^\n]+\n\s*$'s  or  m'^[^\n]+\n\s*$'ms

Adam,

   I didn't write this. It apparently came with the local.cf file a few years
ago.

Rich

-- 
Richard B. Shepard, Ph.D.               |  Integrity            Credibility
Applied Ecosystem Services, Inc.        |            Innovation
<http://www.appl-ecosys.com>     Voice: 503-667-4517      Fax: 503-667-8863

Re: [SA] Identifying Source of False Positives -- RESOLVED

Posted by Adam Katz <an...@khopis.com>.
Rich Shepard wrote:
> # for empty message bodies:
> body       EMPTY_BODY   m'^[^\n]+\n\s*$'
> describe   EMPTY_BODY   Message has subject but no body
> score      EMPTY_BODY   2.5

Egads ... that's an unbounded multi-line regex (that little plus sign is
quite CPU-intensive).  I don't understand its intent, either ... it
looks for a line that includes linebreaks but with no multi-line flag.
Ignoring that bug, it wants a nonzero line followed by either a blank
line or a line filled only with spaces.  How does this characterize an
empty body?  What does this have to do with the presence of a subject?

Since that regex matches nothing, I assume you meant it to be
m'^[^\n]+\n\s*$'s  or  m'^[^\n]+\n\s*$'ms

With a trailing s, that rule matches one-line emails that end in a blank
line (which are quite common).

With a trailing ms, that rule matches any email with a paragraph in it
(like this one), which is almost every single email.

It appears you wanted something like this:

body     __EMPTY_BODY  !~ m'\w\n\w's
meta     SUBJ_NO_BODY  __EMPTY_BODY && __HAS_SUBJECT
describe SUBJ_NO_BODY  Message has subject but no body
score    SUBJ_NO_BODY  2.5

Or perhaps like this:

body     EMPTY_BODY    !~ m'\w\n\w's
describe EMPTY_BODY    Message has no text in body
score    EMPTY_BODY    2.5

Also, that score seems pretty high, and I wonder about your intent.  If
you're trying to use it to catch image-only spam, please use the other
rules we've proposed on the list, like MIME_IMAGE_ONLY.

Re: Identifying Source of False Positives -- RESOLVED

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Fri, 5 Jun 2009, Bowie Bailey wrote:

> In that case, you should be able to track down the issue by comparing the
> two files. Is the EMPTY_BODY rule defined in the old local.cf file? If
> so, what does it say?

Bowie,

   Yes, it was in the old local.cf:

# for empty message bodies:
body       EMPTY_BODY   m'^[^\n]+\n\s*$'
describe   EMPTY_BODY   Message has subject but no body
score      EMPTY_BODY   2.5

   It apparently used to work, but isn't with the new SA to which I upgraded
a few months ago.

Thanks,

Rich

-- 
Richard B. Shepard, Ph.D.               |  Integrity            Credibility
Applied Ecosystem Services, Inc.        |            Innovation
<http://www.appl-ecosys.com>     Voice: 503-667-4517      Fax: 503-667-8863

Re: Identifying Source of False Positives -- RESOLVED

Posted by Bowie Bailey <Bo...@BUC.com>.
Rich Shepard wrote:
>>> The empty body problem is a more difficult problem.  Have procmail 
>>> save a
>>> copy of the raw message somewhere and take a look at it.  Make sure 
>>> there
>>> is a blank line between the headers and the body.  Run 'spamassassin 
>>> -D'
>>> on this saved message and look for anything unusual in the debug 
>>> output.
>
>   This seems to have been resolved by replacing the old
> /etc/mail/spamassassin/local.cf with the new version. Many fewer rules 
> and
> other entries, but I no longer see the EMPTY_BODY test adding 2.5 to the
> scores.

In that case, you should be able to track down the issue by comparing 
the two files.  Is the EMPTY_BODY rule defined in the old local.cf 
file?  If so, what does it say?

-- 
Bowie

Re: Identifying Source of False Positives -- RESOLVED

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Tue, 2 Jun 2009, Rich Shepard wrote:

>  I started doing this today. Each of the false positive messages was
> exported from alpine to a file, and I ran sa-learn on that file telling it
> the text is ham.

   Today the mail and logwatch summary reports appeared in my inbox and there
were no false positives in the holding cell. This may have resolved the
issue of missing messages, but I'll continue to monitor and train SA on the
ham that was mistakenly labeled as spam.

>> The empty body problem is a more difficult problem.  Have procmail save a
>> copy of the raw message somewhere and take a look at it.  Make sure there
>> is a blank line between the headers and the body.  Run 'spamassassin -D'
>> on this saved message and look for anything unusual in the debug output.

   This seems to have been resolved by replacing the old
/etc/mail/spamassassin/local.cf with the new version. Many fewer rules and
other entries, but I no longer see the EMPTY_BODY test adding 2.5 to the
scores.

Thank you all very much,

Rich

-- 
Richard B. Shepard, Ph.D.               |  Integrity            Credibility
Applied Ecosystem Services, Inc.        |            Innovation
<http://www.appl-ecosys.com>     Voice: 503-667-4517      Fax: 503-667-8863

Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Mon, 1 Jun 2009, Bowie Bailey wrote:

> Your biggest problems here are BAYES_99 and EMPTY_BODY.  To fix the Bayes
> problem, sa-learn some of these messages as ham.  Make sure you are
> learning as the right user...

Bowie,

   I started doing this today. Each of the false positive messages was
exported from alpine to a file, and I ran sa-learn on that file telling it
the text is ham.

> The empty body problem is a more difficult problem.  Have procmail save a
> copy of the raw message somewhere and take a look at it.  Make sure there
> is a blank line between the headers and the body.  Run 'spamassassin -D'
> on this saved message and look for anything unusual in the debug output.

   Part of the problem is that I cannot tell what's unusual in the debug
output. When I tried this yesterday (properly), I saw where the score
suddenly jumped from 1.2 to 5.21 with no visible (to me) explanation.

Rich

Re: Identifying Source of False Positives

Posted by Bowie Bailey <Bo...@BUC.com>.
Rich Shepard wrote:
>   Here are all headers from the mail log summary:
>
> From root@salmo.appl-ecosys.com Mon Jun  1 11:25:44 2009
> Return-Path: <ro...@salmo.appl-ecosys.com>
> X-Spam-Flag: YES
> X-Spam-Checker-Version: SpamAssassin 3.2.5-ph20040310.0 (2008-06-10) on
>     salmo.appl-ecosys.com
> X-Spam-Level: ****
> X-Spam-Status: Yes, score=4.9 required=4.0 
> tests=ALL_TRUSTED,AWL,BAYES_99,
>     EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
>     autolearn=no version=3.2.5-ph20040310.0
> X-Spam-Report:
>     * -1.3 ALL_TRUSTED Passed through trusted hosts only via SMTP
>     *  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
>     *      [score: 1.0000]
>     *  2.5 EMPTY_BODY BODY: Message has subject but no body
>     *  0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in URL
>     *  0.4 URI_HEX URI: URI hostname has long hexadecimal sequence
>     *  0.0 NUMERIC_HTTP_ADDR URI: Uses a numeric IP address in URL
>     *  1.6 URI_NOVOWEL URI: URI hostname has long non-vowel sequence
>     * -1.8 AWL AWL: From: address is in the auto white-list
> X-Original-To: rshepard@appl-ecosys.com
> Delivered-To: rshepard@appl-ecosys.com
> Received: from salmo.appl-ecosys.com (localhost.localdomain [127.0.0.1])
>     by salmo.appl-ecosys.com (Postfix) with ESMTP id 8DA0F1026
>     for <rs...@appl-ecosys.com>; Mon,  1 Jun 2009 11:25:44 -0700 (PDT)
> Received: (from root@localhost)
>     by salmo.appl-ecosys.com (8.14.3/8.14.2/Submit) id n51IPibx030133;
>     Mon, 1 Jun 2009 11:25:44 -0700
> Date: Mon, 1 Jun 2009 11:25:44 -0700
> From: root@salmo.appl-ecosys.com
> Message-Id: <20...@salmo.appl-ecosys.com>
> To: rshepard@appl-ecosys.com
> Subject: *****SPAM***** salmo Daily Mail Report for Monday, 01 June 2009
> X-Spam-Prev-Subject: salmo Daily Mail Report for Monday, 01 June 2009
>
> Report based on information in /var/log/maillog

Your biggest problems here are BAYES_99 and EMPTY_BODY.  To fix the 
Bayes problem, sa-learn some of these messages as ham.  Make sure you 
are learning as the right user...

The empty body problem is a more difficult problem.  Have procmail save 
a copy of the raw message somewhere and take a look at it.  Make sure 
there is a blank line between the headers and the body.  Run 
'spamassassin -D' on this saved message and look for anything unusual in 
the debug output.

-- 
Bowie

Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Mon, 1 Jun 2009, John Hardin wrote:

> If these are system-generated messages, something is improperly training
> SA that they are spam. Do you use autolearn?

John,

   No. Once a week or so I run sa-learn specifying spam on the spam-uncaught
mbox file. Less frequently I run it on mail list files specifying them as
ham.

> Primarily I'd suggest you exclude locally-generated emails from SA
> completely. If you'd post the Received: headers from such a message and
> the procmail stanza where you pass messages to SA for scoring I could
> suggest something.

   Here are all headers from the mail log summary:

>From root@salmo.appl-ecosys.com Mon Jun  1 11:25:44 2009
Return-Path: <ro...@salmo.appl-ecosys.com>
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.2.5-ph20040310.0 (2008-06-10) on
 	salmo.appl-ecosys.com
X-Spam-Level: ****
X-Spam-Status: Yes, score=4.9 required=4.0 tests=ALL_TRUSTED,AWL,BAYES_99,
 	EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
 	autolearn=no version=3.2.5-ph20040310.0
X-Spam-Report:
 	* -1.3 ALL_TRUSTED Passed through trusted hosts only via SMTP
 	*  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
 	*      [score: 1.0000]
 	*  2.5 EMPTY_BODY BODY: Message has subject but no body
 	*  0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in URL
 	*  0.4 URI_HEX URI: URI hostname has long hexadecimal sequence
 	*  0.0 NUMERIC_HTTP_ADDR URI: Uses a numeric IP address in URL
 	*  1.6 URI_NOVOWEL URI: URI hostname has long non-vowel sequence
 	* -1.8 AWL AWL: From: address is in the auto white-list
X-Original-To: rshepard@appl-ecosys.com
Delivered-To: rshepard@appl-ecosys.com
Received: from salmo.appl-ecosys.com (localhost.localdomain [127.0.0.1])
 	by salmo.appl-ecosys.com (Postfix) with ESMTP id 8DA0F1026
 	for <rs...@appl-ecosys.com>; Mon,  1 Jun 2009 11:25:44 -0700 (PDT)
Received: (from root@localhost)
 	by salmo.appl-ecosys.com (8.14.3/8.14.2/Submit) id n51IPibx030133;
 	Mon, 1 Jun 2009 11:25:44 -0700
Date: Mon, 1 Jun 2009 11:25:44 -0700
From: root@salmo.appl-ecosys.com
Message-Id: <20...@salmo.appl-ecosys.com>
To: rshepard@appl-ecosys.com
Subject: *****SPAM***** salmo Daily Mail Report for Monday, 01 June 2009
X-Spam-Prev-Subject: salmo Daily Mail Report for Monday, 01 June 2009

Report based on information in /var/log/maillog

   And this is from ~/procmail/recipes.rc:

## Call SpamAssassin
:0fw: spamassassin.lock
* < 256000
| spamassassin

Thanks,

Rich

Re: Identifying Source of False Positives

Posted by John Hardin <jh...@impsec.org>.
On Mon, 1 Jun 2009, Rich Shepard wrote:

>  Here are the headers:
>
>> From root@salmo.appl-ecosys.com Mon Jun  1 11:25:44 2009
> Return-Path: <ro...@salmo.appl-ecosys.com>
> X-Spam-Flag: YES
> X-Spam-Checker-Version: SpamAssassin 3.2.5-ph20040310.0 (2008-06-10) on
> 	salmo.appl-ecosys.com
> X-Spam-Level: ****
> X-Spam-Status: Yes, score=4.9 required=4.0 tests=ALL_TRUSTED,AWL,BAYES_99,
> 	 EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
> 	 autolearn=no version=3.2.5-ph20040310.0
> X-Spam-Report:
> 	 * -1.3 ALL_TRUSTED Passed through trusted hosts only via SMTP
> 	 *  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
> 	 *      [score: 1.0000]

> I also don't know where the 3.5 on the second test arises.
>

If these are system-generated messages, something is improperly training 
SA that they are spam. Do you use autolearn?

>  Suggestions on how to proceed greatly appreciated.

Primarily I'd suggest you exclude locally-generated emails from SA 
completely. If you'd post the Received: headers from such a message and 
the procmail stanza where you pass messages to SA for scoring I could 
suggest something.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Of the twenty-two civilizations that have appeared in history,
   nineteen of them collapsed when they reached the moral state the
   United States is in now.                          -- Arnold Toynbee
-----------------------------------------------------------------------
  5 days until the 65th anniversary of D-Day

Re: [sa] Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Mon, 1 Jun 2009, Charles Gregory wrote:


> Is there anywhere in the procmail recipe *above* this one that some
> specila condition has been specified as:
>
>   :0fwh
>
> ...which has the effect of 'filtering' the message down to just its
> headers? It wouldn't necessarily have to be a recent change to your
> procmailrc, it might just be a subtle change in the log mail that
> 'triggers' the rule when it didn't before.

Charles,

# BEGIN RECIPES

# Nuke duplicate messages
#:0 Wh: msgid.lock
#| $FORMAIL -D 8192 msgid.cache

## Call SpamAssassin
:0fw: spamassassin.lock
* < 256000
| spamassassin

   The first recipe has been commented out for a while now, so the call to SA
is at the top of the list.

> Next guess: Has this log summary grown in size past some limit that would 
> cause the whole body to be 'truncated'?

   No. The log summary report (with headers) is < 26,000 bytes.

Rich

Re: [sa] Re: Identifying Source of False Positives

Posted by Charles Gregory <cg...@hwcn.org>.
>>  First guess, look at the procmail code that 'chooses' to run spamassassin.
>>  Have you used an 'h' where you meant to use an 'H', thereby feeding *only*
>>  the header to spamassassin?
> ## Call SpamAssassin
> : 0fw: spamassassin.lock
> * < 256000
> |  spamassassin

Is there anywhere in the procmail recipe *above* this one that some 
specila condition has been specified as:

    :0fwh

...which has the effect of 'filtering' the message down to just its
headers? It wouldn't necessarily have to be a recent change to your
procmailrc, it might just be a subtle change in the log mail that
'triggers' the rule when it didn't before.

Next guess: Has this log summary grown in size past some limit that would 
cause the whole body to be 'truncated'?

- Charles

Re: [sa] Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Mon, 1 Jun 2009, Charles Gregory wrote:

> Just to be clear, are you looking at the body in the actual rejected
> message,

Charles,

   Yes. The body consists of the mail log summary.

> First guess, look at the procmail code that 'chooses' to run spamassassin.
> Have you used an 'h' where you meant to use an 'H', thereby feeding *only*
> the header to spamassassin?

## Call SpamAssassin
:0fw: spamassassin.lock
* < 256000
| spamassassin

   This is how it's been for years.

Rich

Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Tue, 2 Jun 2009, Charles Gregory wrote:

> This *really* suggests that one of two things MUST be occuring:
> 1) What you are seeing is NOT what spamassassin "sees".

Charles,

   Quite possible.

> 2) A character (null/ascii-zeros?) has been injected into the e-mail
>   somewhere in the headers, causing Spamassassin to cease its scan at that
>   point...

   Hmm-m-m-m. I cannot perceive a scenario where this is selective. For
example, the log reports sent by local root to me on the local machine, some
messages posted to this mail list (but not others in the same thread), some
messages posted to other mail lists (again, not all in the same thread), and
so on. There is no consistent pattern other than the locally generated log
summary reports.

> Presuming upon the latter, try examining all the headers injected by other
> processes like clamav. Particularly where *some* messages receive this
> treatment, but not *all*, you should be able to find a 'header difference'
> between the passed and failed messages.

   No clamav or similar. We run only linux with incoming mail processed by
postfix and procmail.

> Something to try:
> Setup a custom rule in local.cf to match a custom header
>   X-Spam-Test: YES
> And then , just before you scan the e-mail with spamassasin, use 'formail' to 
> add that header to the mail.

   I've not before used formail. SA is called from within
~/procmail/recipes.rc:

## Call SpamAssassin
:0fw: spamassassin.lock
* < 256000
| spamassassin

   Where do I insert a call to formail and what is the appropriate format?

Thanks,

Rich

-- 
Richard B. Shepard, Ph.D.               |  Integrity            Credibility
Applied Ecosystem Services, Inc.        |            Innovation
<http://www.appl-ecosys.com>     Voice: 503-667-4517      Fax: 503-667-8863

Re: Identifying Source of False Positives

Posted by Charles Gregory <cg...@hwcn.org>.
On Tue, 2 Jun 2009, Rich Shepard wrote:
>  This morning not only was the mail log report and logwatch report falsely
> flagged as spam, but so were several messages posted to the google group
> mail list for an application I use. What is interesting to me is that every
> one had a +2.5 score for EMPTY_BODY, while none of them had empty bodies.

This *really* suggests that one of two things MUST be occuring:

1) What you are seeing is NOT what spamassassin "sees".

2) A character (null/ascii-zeros?) has been injected into the e-mail
    somewhere in the headers, causing Spamassassin to cease its scan at
    that point...

Presuming upon the latter, try examining all the headers injected by other 
processes like clamav. Particularly where *some* messages receive this 
treatment, but not *all*, you should be able to find a 'header difference' 
between the passed and failed messages.

Something to try:
Setup a custom rule in local.cf to match a custom header
    X-Spam-Test: YES
And then , just before you scan the e-mail with spamassasin, use 'formail' 
to add that header to the mail. It will get injected at the end of the 
headers. If the test rule 'hits' then you have a real mystery. If the test 
rule does *not* 'hit', then we have evidence that something is causing 
Spamassassin to behave like an End-Of-File condition has ben reached on 
the mail before it read it all..... Null/zeros or something....

- Charles

Re: [sa] Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Mon, 1 Jun 2009, Charles Gregory wrote:

> Just to be clear, are you looking at the body in the actual rejected
> message, to make sure it is still there (not 'stripped' from the message)?

Charles,

   I hope the following information is helpful in telling you more
experienced folks why I'm having these false positives.

   This morning not only was the mail log report and logwatch report falsely
flagged as spam, but so were several messages posted to the google group
mail list for an application I use. What is interesting to me is that every
one had a +2.5 score for EMPTY_BODY, while none of them had empty bodies.

   Those from the google group mail list might have had a html-formatted
part, but the log reports are plain ASCII text.

   What might possibly be triggering this false rule?

Rich

Re: [sa] Re: Identifying Source of False Positives

Posted by Charles Gregory <cg...@hwcn.org>.
On Mon, 1 Jun 2009, Rich Shepard wrote:
> 	 *  2.5 EMPTY_BODY BODY: Message has subject but no body
>  There is certainly body content in the message; it's not empty so I don't
> understand the 2.5 on that third test. I also don't know where the 3.5 on
> the second test arises.

Just to be clear, are you looking at the body in the actual rejected 
message, to make sure it is still there (not 'stripped' from the message)?
First guess, look at the procmail code that 'chooses' to run spamassassin.
Have you used an 'h' where you meant to use an 'H', thereby feeding *only* 
the header to spamassassin?

- C

Re: Identifying Source of False Positives

Posted by Rich Shepard <rs...@appl-ecosys.com>.
On Mon, 1 Jun 2009, Charles Gregory wrote:

> Well, firstly, examine the mail full headers. There should be an
> X-Spam-Status header listing the tests that matched on the e-mail.

Charles/Dan/John:

   I certainly managed to forget this. I just ran /etc/cron.daily/1pflogsumm
and looked at the report.

   Here are the headers:

>From root@salmo.appl-ecosys.com Mon Jun  1 11:25:44 2009
Return-Path: <ro...@salmo.appl-ecosys.com>
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.2.5-ph20040310.0 (2008-06-10) on
 	salmo.appl-ecosys.com
X-Spam-Level: ****
X-Spam-Status: Yes, score=4.9 required=4.0 tests=ALL_TRUSTED,AWL,BAYES_99,
 	EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
 	autolearn=no version=3.2.5-ph20040310.0
X-Spam-Report:
 	* -1.3 ALL_TRUSTED Passed through trusted hosts only via SMTP
 	*  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
 	*      [score: 1.0000]
 	*  2.5 EMPTY_BODY BODY: Message has subject but no body
 	*  0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in URL
 	*  0.4 URI_HEX URI: URI hostname has long hexadecimal sequence
 	*  0.0 NUMERIC_HTTP_ADDR URI: Uses a numeric IP address in URL
 	*  1.6 URI_NOVOWEL URI: URI hostname has long non-vowel sequence
 	* -1.8 AWL AWL: From: address is in the auto white-list
X-Original-To: rshepard@appl-ecosys.com

   I can send the entire report if that's necessary.

   There is certainly body content in the message; it's not empty so I don't
understand the 2.5 on that third test. I also don't know where the 3.5 on
the second test arises.

   For about a decade these log summary reports showed up every day with no
problems. Earlier this spring they became sporatic, then ceased appearing at
all. This correlates with a distribution and SpamAssassin upgrade, so it
must be something different in SA that's triggering this response now.

   Suggestions on how to proceed greatly appreciated.

Thanks,

Rich

Re: Identifying Source of False Positives

Posted by Charles Gregory <cg...@hwcn.org>.
On Mon, 1 Jun 2009, Rich Shepard wrote:
> messages that have not before been seen as spam by SA. Specifically, the
> daily postfix mail log summary report and the daily logwatch report are
> marked at spam;

Well, firstly, examine the mail full headers. There should be an
X-Spam-Status header listing the tests that matched on the e-mail.

At a first guess, I would suspect that your log includes a reference to
a blacklisted URI or e-mail. Given the nature of logs to contain 
information of this sort, I would strongly urge you to 'whitelist' the 
logs. For that matter, if this is internally generated mail, why are you 
running spamassassin at all? Or is this mail being passed via an outside 
(untrusted) network to your mailbox?

- C

Re: Identifying Source of False Positives

Posted by John Hardin <jh...@impsec.org>.
On Mon, 1 Jun 2009, Rich Shepard wrote:

>  I'm running SA-3.2.5 on Slackware-12.2 and encountering false positives on
> messages that have not before been seen as spam by SA. Specifically, the
> daily postfix mail log summary report and the daily logwatch report are
> marked at spam; they are sent by root to me as a user.

That sort of thing shouldn't even be hitting SA. If you're using procmail 
to glue in SA, you might want to add some exclusionary clauses to the 
stanza that calls SA.

>  Over the past few months I've also had problems with messages from three
> specific domains that were never delivered to my inbox. However, when a
> procmail recipe directed all messages to me at my business domain to a
> different mail file, they were delivered.

It can be a bad idea, particularly if you're an administrator or delegate 
for the postmaster@ or abuse@ aliases, to discard mail that SA has marked 
as spam. Quarantine it and periodically review the quarantine.

> How can I determine what causes SA to mark the log summary reports as
> spam? This is the first issue I want to resolve.

First, capture the messages rather than discarding them. The FPs should 
have the list of rules that hit in the headers.

For historical messages you should be able to look in your mail log 
(typically /var/log/maillog or rotated to /var/log/maillog.1.gz etc.) for 
the SA log entry for the messages in question, which also list the rules 
hit.

If you post the list of rules hit, or better a complete FP message with 
all headers intact, we may be able to suggest more precisely. Please don't 
post messages to the list; post them on pastebin or a webserver you 
control, and send the URL to the list.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   It is not the business of government to make men virtuous or
   religious, or to preserve the fool from the consequences of his own
   folly.                                              -- Henry George
-----------------------------------------------------------------------
  5 days until the 65th anniversary of D-Day

Re: Identifying Source of False Positives

Posted by "McDonald, Dan" <Da...@austinenergy.com>.
On Mon, 2009-06-01 at 09:28 -0700, Rich Shepard wrote:
> I'm running SA-3.2.5 on Slackware-12.2 and encountering false positives on
> messages that have not before been seen as spam by SA. Specifically, the
> daily postfix mail log summary report and the daily logwatch report are
> marked at spam; they are sent by root to me as a user. Because
> /etc/procmailrc threw these messages away it took a long time to figure out
> that it was SA mis-labeling these messages that was the immediate problem.
> 
>    Over the past few months I've also had problems with messages from three
> specific domains that were never delivered to my inbox. However, when a
> procmail recipe directed all messages to me at my business domain to a
> different mail file, they were delivered.
> 
>    How can I determine what causes SA to mark the log summary reports as
> spam? 

run the message though spamassassin -D and see what tests fire.

Most likely it will be that some of the domains that are reported in
your summary are listed in URIBL, SURBL, or some other uri block list.


-- 
Daniel J McDonald, CCIE # 2495, CISSP # 78281, CNX
www.austinenergy.com