You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by mw <mw...@stocznia.gdynia.pl> on 2005/03/10 11:33:08 UTC

I can't autolearn bayes databases with spam - continuation

Many thanks for all previous mailing lists referring to problems with
autolearn=spam.
I've taken into account your remarks and first of all I've fed my
bayesian databases.
Now, this my resulat of sa-learn --dump -magic command :

0.000          0          3          0  non-token data: bayes db version
0.000          0        272          0  non-token data: nspam
0.000          0        245          0  non-token data: nham
0.000          0      21292          0  non-token data: ntokens
0.000          0 1109767086          0  non-token data: oldest atime
0.000          0 1110286647          0  non-token data: newest atime
0.000          0 1110365778          0  non-token data: last journal sync
atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime
delta
0.000          0          0          0  non-token data: last expire
reduction count

And local.cf as you can see ( defaults values for bayesian filtering and
that's why it
is on ) :

score MISSING_SUBJECT 15.0
score NIGERIAN_BODY1  15.0
bayes_file_mode         0770
skip_rbl_checks         0
use_razor2              1
use_dcc                 1
use_pyzor               1
ok_languages            en pl
ok_locales              en

I remind you I've prepared script which makes my own spams and sends them to
my mail server
This server is placed in local net, not in Internet because I'm only testing
SpamAssassin.

Here are analisis of details of my examplary spam :

Content analysis details:   (41.3 points, 5.0 required)

 pts rule name              description
---- ---------------------- ------------------------------------------------
--
 1.3 FROM_NO_LOWER          From address has no lower-case characters
-2.8 ALL_TRUSTED            Did not pass through any untrusted hosts
 0.8 AMATEUR_PORN           BODY: Possible porn - Amateur Porn
 1.3 MILLION_USD            BODY: Talks about millions of dollars
 0.5 SUBJ_2_CREDIT          BODY: Contains 'subject to credit approval'
 0.8 DEAR_FRIEND            BODY: Dear Friend? That's not very dear!
 0.4 US_DOLLARS_3           BODY: Mentions millions of $ ($NN,NNN,NNN.NN)
 0.5 BODY_ENHANCEMENT       BODY: Information on growing body parts
 0.6 PORN_URL_MISC          URI: URL uses words/phrases which indicate porn
(misc)
 0.0 DRUGS_ERECTILE         Refers to an erectile drug
  15 MISSING_SUBJECT        Missing Subject: header
 0.5 UPPERCASE_75_100       message body is 75-100% uppercase
 0.5 NIGERIAN_BODY2         Message body looks like a Nigerian spam message
2+
  15 NIGERIAN_BODY1         Message body looks like a Nigerian spam message
1+
 1.4 INVALID_MSGID          Message-Id is not valid, according to RFC 2822
 1.4 NIGERIAN_BODY4         Message body looks like a Nigerian spam message
4+
 2.3 LONGWORDS              Long string of long words
 1.9 NIGERIAN_BODY3         Message body looks like a Nigerian spam message
3+

And the header contents of above mentioned spam :

X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on kronos
X-Spam-Level: ******************************************
X-Spam-Status: Yes, score=42.5 required=5.0 tests=ALL_TRUSTED,AMATEUR_PORN,
 BAYES_99,DEAR_FRIEND,DRUGS_ERECTILE,FROM_NO_LOWER,INVALID_MSGID,
 LONGWORDS,MILLION_USD,MISSING_SUBJECT,NIGERIAN_BODY1,NIGERIAN_BODY2,
 NIGERIAN_BODY3,NIGERIAN_BODY4,PORN_URL_MISC,SUBJ_2_CREDIT,
 UPPERCASE_50_75,US_DOLLARS_3 autolearn=no version=3.0.2
X-Spam-Report:
 *  0.4 FROM_NO_LOWER From address has no lower-case characters
 * -3.3 ALL_TRUSTED Did not pass through any untrusted hosts
 *  1.7 AMATEUR_PORN BODY: Possible porn - Amateur Porn
 *  2.8 MILLION_USD BODY: Talks about millions of dollars
 *  0.1 SUBJ_2_CREDIT BODY: Contains 'subject to credit approval'
 *  0.1 DEAR_FRIEND BODY: Dear Friend? That's not very dear!
 *  0.4 US_DOLLARS_3 BODY: Mentions millions of $ ($NN,NNN,NNN.NN)
 *  1.6 PORN_URL_MISC URI: URL uses words/phrases which indicate porn (misc)
 *  1.9 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
 *      [score: 1.0000]
 *  0.2 DRUGS_ERECTILE Refers to an erectile drug
 *   15 MISSING_SUBJECT Missing Subject: header
 *  0.0 UPPERCASE_50_75 message body is 50-75% uppercase
 *  0.6 NIGERIAN_BODY2 Message body looks like a Nigerian spam message 2+
 *   15 NIGERIAN_BODY1 Message body looks like a Nigerian spam message 1+
 *  1.1 INVALID_MSGID Message-Id is not valid, according to RFC 2822
 *  2.7 NIGERIAN_BODY4 Message body looks like a Nigerian spam message 4+
 *  2.3 LONGWORDS Long string of long words
 *  0.1 NIGERIAN_BODY3 Message body looks like a Nigerian spam message 3+

My questions are :

My main question
1) I still can't see a mail with the header containing autolearn=spam.
  It seems that this spam should feed databases as spam because :
  - it has more than 3 points from the header and more than 3 points from
the body
  - the score is more than 12 points (bayes_auto_learn_threshold_spam
12.0)
  However if the score of the mail is less than 0.1, autolearning works
correctly
  ( in the header it can see autolearn=ham ).

 I suppose autolearning with spam doesn't work properly (????)

And the other ones :

2) There are differences beetwen scores of tests in the Content analysis
details and
   in the header ( see above ).
   For example, FROM_NO_LOWER test has 1.3 pts in Content analysis details
and 0.4 in
   the header ;  in Content analysis details it can't see BAYES_99 BODY test
at all,
   but in the header you can see this test.
   Why ?

3) I added the following lines to local.cf :

   rewrite_subject         1
   subject_tag             *****SPAM*****
   use_terse_report        0
   auto_learn              1

   Now, if I run spamassassin -D --lint
   I find the statements :

   config: SpamAssassin failed to parse line, skipping: rewrite_subject
1
   config: SpamAssassin failed to parse line, skipping: subject_tag
*****SPAM*****
   config: SpamAssassin failed to parse line, skipping: use_terse_report
0
   config: SpamAssassin failed to parse line, skipping: auto_learn
1
   config: SpamAssassin failed to parse line, skipping: rewrite_subject
1
   config: SpamAssassin failed to parse line, skipping: subject_tag
*****SPAM*****
   config: SpamAssassin failed to parse line, skipping: use_terse_report
0

   What does it mean ?

Regards

Mirek Wasik





Re: I can't autolearn bayes databases with spam - continuation

Posted by Matt Kettler <mk...@comcast.net>.
At 05:33 AM 3/10/2005, mw wrote:
>3) I added the following lines to local.cf :
>
>    rewrite_subject         1
>    subject_tag             *****SPAM*****
>    use_terse_report        0
>    auto_learn              1
>
>    Now, if I run spamassassin -D --lint
>    I find the statements :
>
>    config: SpamAssassin failed to parse line, skipping: rewrite_subject

<snip>

>    What does it mean ?

It means all of those options are invalid.

rewrite_subject and subject_tag have been replaced by rewrite_header as of 3.0

auto_learn has been replaced by bayes_auto_learn long ago, although the 2.6 
series would still handle the typo for you.

use_terse_report has been deprecated  for a long time, and 2.6 would 
silently ignore the option as it was meaningless. The combination of 
report_safe and the report template commands in recent versions of SA are 
substantially more flexible anyway.




RE: I can't autolearn bayes databases with spam - continuation

Posted by Greg Allen <ga...@netrox.net>.
If you still have this, it is not going to work, as I said.


bayes_auto_learn_threshold_nonspam      0.1
bayes_auto_learn_threshold_spam         7.0



-----Original Message-----
From: mw [mailto:mw@stocznia.gdynia.pl]
Sent: Thursday, March 10, 2005 5:33 AM
To: users@spamassassin.apache.org
Subject: I can't autolearn bayes databases with spam - continuation


Many thanks for all previous mailing lists referring to problems with
autolearn=spam.
I've taken into account your remarks and first of all I've fed my
bayesian databases.
Now, this my resulat of sa-learn --dump -magic command :

0.000          0          3          0  non-token data: bayes db version
0.000          0        272          0  non-token data: nspam
0.000          0        245          0  non-token data: nham
0.000          0      21292          0  non-token data: ntokens
0.000          0 1109767086          0  non-token data: oldest atime
0.000          0 1110286647          0  non-token data: newest atime
0.000          0 1110365778          0  non-token data: last journal sync
atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime
delta
0.000          0          0          0  non-token data: last expire
reduction count

And local.cf as you can see ( defaults values for bayesian filtering and
that's why it
is on ) :

score MISSING_SUBJECT 15.0
score NIGERIAN_BODY1  15.0
bayes_file_mode         0770
skip_rbl_checks         0
use_razor2              1
use_dcc                 1
use_pyzor               1
ok_languages            en pl
ok_locales              en

I remind you I've prepared script which makes my own spams and sends them to
my mail server
This server is placed in local net, not in Internet because I'm only testing
SpamAssassin.

Here are analisis of details of my examplary spam :

Content analysis details:   (41.3 points, 5.0 required)

 pts rule name              description
---- ---------------------- ------------------------------------------------
--
 1.3 FROM_NO_LOWER          From address has no lower-case characters
-2.8 ALL_TRUSTED            Did not pass through any untrusted hosts
 0.8 AMATEUR_PORN           BODY: Possible porn - Amateur Porn
 1.3 MILLION_USD            BODY: Talks about millions of dollars
 0.5 SUBJ_2_CREDIT          BODY: Contains 'subject to credit approval'
 0.8 DEAR_FRIEND            BODY: Dear Friend? That's not very dear!
 0.4 US_DOLLARS_3           BODY: Mentions millions of $ ($NN,NNN,NNN.NN)
 0.5 BODY_ENHANCEMENT       BODY: Information on growing body parts
 0.6 PORN_URL_MISC          URI: URL uses words/phrases which indicate porn
(misc)
 0.0 DRUGS_ERECTILE         Refers to an erectile drug
  15 MISSING_SUBJECT        Missing Subject: header
 0.5 UPPERCASE_75_100       message body is 75-100% uppercase
 0.5 NIGERIAN_BODY2         Message body looks like a Nigerian spam message
2+
  15 NIGERIAN_BODY1         Message body looks like a Nigerian spam message
1+
 1.4 INVALID_MSGID          Message-Id is not valid, according to RFC 2822
 1.4 NIGERIAN_BODY4         Message body looks like a Nigerian spam message
4+
 2.3 LONGWORDS              Long string of long words
 1.9 NIGERIAN_BODY3         Message body looks like a Nigerian spam message
3+

And the header contents of above mentioned spam :

X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on kronos
X-Spam-Level: ******************************************
X-Spam-Status: Yes, score=42.5 required=5.0 tests=ALL_TRUSTED,AMATEUR_PORN,
 BAYES_99,DEAR_FRIEND,DRUGS_ERECTILE,FROM_NO_LOWER,INVALID_MSGID,
 LONGWORDS,MILLION_USD,MISSING_SUBJECT,NIGERIAN_BODY1,NIGERIAN_BODY2,
 NIGERIAN_BODY3,NIGERIAN_BODY4,PORN_URL_MISC,SUBJ_2_CREDIT,
 UPPERCASE_50_75,US_DOLLARS_3 autolearn=no version=3.0.2
X-Spam-Report:
 *  0.4 FROM_NO_LOWER From address has no lower-case characters
 * -3.3 ALL_TRUSTED Did not pass through any untrusted hosts
 *  1.7 AMATEUR_PORN BODY: Possible porn - Amateur Porn
 *  2.8 MILLION_USD BODY: Talks about millions of dollars
 *  0.1 SUBJ_2_CREDIT BODY: Contains 'subject to credit approval'
 *  0.1 DEAR_FRIEND BODY: Dear Friend? That's not very dear!
 *  0.4 US_DOLLARS_3 BODY: Mentions millions of $ ($NN,NNN,NNN.NN)
 *  1.6 PORN_URL_MISC URI: URL uses words/phrases which indicate porn (misc)
 *  1.9 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
 *      [score: 1.0000]
 *  0.2 DRUGS_ERECTILE Refers to an erectile drug
 *   15 MISSING_SUBJECT Missing Subject: header
 *  0.0 UPPERCASE_50_75 message body is 50-75% uppercase
 *  0.6 NIGERIAN_BODY2 Message body looks like a Nigerian spam message 2+
 *   15 NIGERIAN_BODY1 Message body looks like a Nigerian spam message 1+
 *  1.1 INVALID_MSGID Message-Id is not valid, according to RFC 2822
 *  2.7 NIGERIAN_BODY4 Message body looks like a Nigerian spam message 4+
 *  2.3 LONGWORDS Long string of long words
 *  0.1 NIGERIAN_BODY3 Message body looks like a Nigerian spam message 3+

My questions are :

My main question
1) I still can't see a mail with the header containing autolearn=spam.
  It seems that this spam should feed databases as spam because :
  - it has more than 3 points from the header and more than 3 points from
the body
  - the score is more than 12 points (bayes_auto_learn_threshold_spam
12.0)
  However if the score of the mail is less than 0.1, autolearning works
correctly
  ( in the header it can see autolearn=ham ).

 I suppose autolearning with spam doesn't work properly (????)

And the other ones :

2) There are differences beetwen scores of tests in the Content analysis
details and
   in the header ( see above ).
   For example, FROM_NO_LOWER test has 1.3 pts in Content analysis details
and 0.4 in
   the header ;  in Content analysis details it can't see BAYES_99 BODY test
at all,
   but in the header you can see this test.
   Why ?

3) I added the following lines to local.cf :

   rewrite_subject         1
   subject_tag             *****SPAM*****
   use_terse_report        0
   auto_learn              1

   Now, if I run spamassassin -D --lint
   I find the statements :

   config: SpamAssassin failed to parse line, skipping: rewrite_subject
1
   config: SpamAssassin failed to parse line, skipping: subject_tag
*****SPAM*****
   config: SpamAssassin failed to parse line, skipping: use_terse_report
0
   config: SpamAssassin failed to parse line, skipping: auto_learn
1
   config: SpamAssassin failed to parse line, skipping: rewrite_subject
1
   config: SpamAssassin failed to parse line, skipping: subject_tag
*****SPAM*****
   config: SpamAssassin failed to parse line, skipping: use_terse_report
0

   What does it mean ?

Regards

Mirek Wasik