You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by mw <mw...@stocznia.gdynia.pl> on 2005/03/10 11:33:08 UTC
I can't autolearn bayes databases with spam - continuation
Many thanks for all previous mailing lists referring to problems with
autolearn=spam.
I've taken into account your remarks and first of all I've fed my
bayesian databases.
Now, this my resulat of sa-learn --dump -magic command :
0.000 0 3 0 non-token data: bayes db version
0.000 0 272 0 non-token data: nspam
0.000 0 245 0 non-token data: nham
0.000 0 21292 0 non-token data: ntokens
0.000 0 1109767086 0 non-token data: oldest atime
0.000 0 1110286647 0 non-token data: newest atime
0.000 0 1110365778 0 non-token data: last journal sync
atime
0.000 0 0 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime
delta
0.000 0 0 0 non-token data: last expire
reduction count
And local.cf as you can see ( defaults values for bayesian filtering and
that's why it
is on ) :
score MISSING_SUBJECT 15.0
score NIGERIAN_BODY1 15.0
bayes_file_mode 0770
skip_rbl_checks 0
use_razor2 1
use_dcc 1
use_pyzor 1
ok_languages en pl
ok_locales en
I remind you I've prepared script which makes my own spams and sends them to
my mail server
This server is placed in local net, not in Internet because I'm only testing
SpamAssassin.
Here are analisis of details of my examplary spam :
Content analysis details: (41.3 points, 5.0 required)
pts rule name description
---- ---------------------- ------------------------------------------------
--
1.3 FROM_NO_LOWER From address has no lower-case characters
-2.8 ALL_TRUSTED Did not pass through any untrusted hosts
0.8 AMATEUR_PORN BODY: Possible porn - Amateur Porn
1.3 MILLION_USD BODY: Talks about millions of dollars
0.5 SUBJ_2_CREDIT BODY: Contains 'subject to credit approval'
0.8 DEAR_FRIEND BODY: Dear Friend? That's not very dear!
0.4 US_DOLLARS_3 BODY: Mentions millions of $ ($NN,NNN,NNN.NN)
0.5 BODY_ENHANCEMENT BODY: Information on growing body parts
0.6 PORN_URL_MISC URI: URL uses words/phrases which indicate porn
(misc)
0.0 DRUGS_ERECTILE Refers to an erectile drug
15 MISSING_SUBJECT Missing Subject: header
0.5 UPPERCASE_75_100 message body is 75-100% uppercase
0.5 NIGERIAN_BODY2 Message body looks like a Nigerian spam message
2+
15 NIGERIAN_BODY1 Message body looks like a Nigerian spam message
1+
1.4 INVALID_MSGID Message-Id is not valid, according to RFC 2822
1.4 NIGERIAN_BODY4 Message body looks like a Nigerian spam message
4+
2.3 LONGWORDS Long string of long words
1.9 NIGERIAN_BODY3 Message body looks like a Nigerian spam message
3+
And the header contents of above mentioned spam :
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on kronos
X-Spam-Level: ******************************************
X-Spam-Status: Yes, score=42.5 required=5.0 tests=ALL_TRUSTED,AMATEUR_PORN,
BAYES_99,DEAR_FRIEND,DRUGS_ERECTILE,FROM_NO_LOWER,INVALID_MSGID,
LONGWORDS,MILLION_USD,MISSING_SUBJECT,NIGERIAN_BODY1,NIGERIAN_BODY2,
NIGERIAN_BODY3,NIGERIAN_BODY4,PORN_URL_MISC,SUBJ_2_CREDIT,
UPPERCASE_50_75,US_DOLLARS_3 autolearn=no version=3.0.2
X-Spam-Report:
* 0.4 FROM_NO_LOWER From address has no lower-case characters
* -3.3 ALL_TRUSTED Did not pass through any untrusted hosts
* 1.7 AMATEUR_PORN BODY: Possible porn - Amateur Porn
* 2.8 MILLION_USD BODY: Talks about millions of dollars
* 0.1 SUBJ_2_CREDIT BODY: Contains 'subject to credit approval'
* 0.1 DEAR_FRIEND BODY: Dear Friend? That's not very dear!
* 0.4 US_DOLLARS_3 BODY: Mentions millions of $ ($NN,NNN,NNN.NN)
* 1.6 PORN_URL_MISC URI: URL uses words/phrases which indicate porn (misc)
* 1.9 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
* [score: 1.0000]
* 0.2 DRUGS_ERECTILE Refers to an erectile drug
* 15 MISSING_SUBJECT Missing Subject: header
* 0.0 UPPERCASE_50_75 message body is 50-75% uppercase
* 0.6 NIGERIAN_BODY2 Message body looks like a Nigerian spam message 2+
* 15 NIGERIAN_BODY1 Message body looks like a Nigerian spam message 1+
* 1.1 INVALID_MSGID Message-Id is not valid, according to RFC 2822
* 2.7 NIGERIAN_BODY4 Message body looks like a Nigerian spam message 4+
* 2.3 LONGWORDS Long string of long words
* 0.1 NIGERIAN_BODY3 Message body looks like a Nigerian spam message 3+
My questions are :
My main question
1) I still can't see a mail with the header containing autolearn=spam.
It seems that this spam should feed databases as spam because :
- it has more than 3 points from the header and more than 3 points from
the body
- the score is more than 12 points (bayes_auto_learn_threshold_spam
12.0)
However if the score of the mail is less than 0.1, autolearning works
correctly
( in the header it can see autolearn=ham ).
I suppose autolearning with spam doesn't work properly (????)
And the other ones :
2) There are differences beetwen scores of tests in the Content analysis
details and
in the header ( see above ).
For example, FROM_NO_LOWER test has 1.3 pts in Content analysis details
and 0.4 in
the header ; in Content analysis details it can't see BAYES_99 BODY test
at all,
but in the header you can see this test.
Why ?
3) I added the following lines to local.cf :
rewrite_subject 1
subject_tag *****SPAM*****
use_terse_report 0
auto_learn 1
Now, if I run spamassassin -D --lint
I find the statements :
config: SpamAssassin failed to parse line, skipping: rewrite_subject
1
config: SpamAssassin failed to parse line, skipping: subject_tag
*****SPAM*****
config: SpamAssassin failed to parse line, skipping: use_terse_report
0
config: SpamAssassin failed to parse line, skipping: auto_learn
1
config: SpamAssassin failed to parse line, skipping: rewrite_subject
1
config: SpamAssassin failed to parse line, skipping: subject_tag
*****SPAM*****
config: SpamAssassin failed to parse line, skipping: use_terse_report
0
What does it mean ?
Regards
Mirek Wasik
Re: I can't autolearn bayes databases with spam - continuation
Posted by Matt Kettler <mk...@comcast.net>.
At 05:33 AM 3/10/2005, mw wrote:
>3) I added the following lines to local.cf :
>
> rewrite_subject 1
> subject_tag *****SPAM*****
> use_terse_report 0
> auto_learn 1
>
> Now, if I run spamassassin -D --lint
> I find the statements :
>
> config: SpamAssassin failed to parse line, skipping: rewrite_subject
<snip>
> What does it mean ?
It means all of those options are invalid.
rewrite_subject and subject_tag have been replaced by rewrite_header as of 3.0
auto_learn has been replaced by bayes_auto_learn long ago, although the 2.6
series would still handle the typo for you.
use_terse_report has been deprecated for a long time, and 2.6 would
silently ignore the option as it was meaningless. The combination of
report_safe and the report template commands in recent versions of SA are
substantially more flexible anyway.
RE: I can't autolearn bayes databases with spam - continuation
Posted by Greg Allen <ga...@netrox.net>.
If you still have this, it is not going to work, as I said.
bayes_auto_learn_threshold_nonspam 0.1
bayes_auto_learn_threshold_spam 7.0
-----Original Message-----
From: mw [mailto:mw@stocznia.gdynia.pl]
Sent: Thursday, March 10, 2005 5:33 AM
To: users@spamassassin.apache.org
Subject: I can't autolearn bayes databases with spam - continuation
Many thanks for all previous mailing lists referring to problems with
autolearn=spam.
I've taken into account your remarks and first of all I've fed my
bayesian databases.
Now, this my resulat of sa-learn --dump -magic command :
0.000 0 3 0 non-token data: bayes db version
0.000 0 272 0 non-token data: nspam
0.000 0 245 0 non-token data: nham
0.000 0 21292 0 non-token data: ntokens
0.000 0 1109767086 0 non-token data: oldest atime
0.000 0 1110286647 0 non-token data: newest atime
0.000 0 1110365778 0 non-token data: last journal sync
atime
0.000 0 0 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime
delta
0.000 0 0 0 non-token data: last expire
reduction count
And local.cf as you can see ( defaults values for bayesian filtering and
that's why it
is on ) :
score MISSING_SUBJECT 15.0
score NIGERIAN_BODY1 15.0
bayes_file_mode 0770
skip_rbl_checks 0
use_razor2 1
use_dcc 1
use_pyzor 1
ok_languages en pl
ok_locales en
I remind you I've prepared script which makes my own spams and sends them to
my mail server
This server is placed in local net, not in Internet because I'm only testing
SpamAssassin.
Here are analisis of details of my examplary spam :
Content analysis details: (41.3 points, 5.0 required)
pts rule name description
---- ---------------------- ------------------------------------------------
--
1.3 FROM_NO_LOWER From address has no lower-case characters
-2.8 ALL_TRUSTED Did not pass through any untrusted hosts
0.8 AMATEUR_PORN BODY: Possible porn - Amateur Porn
1.3 MILLION_USD BODY: Talks about millions of dollars
0.5 SUBJ_2_CREDIT BODY: Contains 'subject to credit approval'
0.8 DEAR_FRIEND BODY: Dear Friend? That's not very dear!
0.4 US_DOLLARS_3 BODY: Mentions millions of $ ($NN,NNN,NNN.NN)
0.5 BODY_ENHANCEMENT BODY: Information on growing body parts
0.6 PORN_URL_MISC URI: URL uses words/phrases which indicate porn
(misc)
0.0 DRUGS_ERECTILE Refers to an erectile drug
15 MISSING_SUBJECT Missing Subject: header
0.5 UPPERCASE_75_100 message body is 75-100% uppercase
0.5 NIGERIAN_BODY2 Message body looks like a Nigerian spam message
2+
15 NIGERIAN_BODY1 Message body looks like a Nigerian spam message
1+
1.4 INVALID_MSGID Message-Id is not valid, according to RFC 2822
1.4 NIGERIAN_BODY4 Message body looks like a Nigerian spam message
4+
2.3 LONGWORDS Long string of long words
1.9 NIGERIAN_BODY3 Message body looks like a Nigerian spam message
3+
And the header contents of above mentioned spam :
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on kronos
X-Spam-Level: ******************************************
X-Spam-Status: Yes, score=42.5 required=5.0 tests=ALL_TRUSTED,AMATEUR_PORN,
BAYES_99,DEAR_FRIEND,DRUGS_ERECTILE,FROM_NO_LOWER,INVALID_MSGID,
LONGWORDS,MILLION_USD,MISSING_SUBJECT,NIGERIAN_BODY1,NIGERIAN_BODY2,
NIGERIAN_BODY3,NIGERIAN_BODY4,PORN_URL_MISC,SUBJ_2_CREDIT,
UPPERCASE_50_75,US_DOLLARS_3 autolearn=no version=3.0.2
X-Spam-Report:
* 0.4 FROM_NO_LOWER From address has no lower-case characters
* -3.3 ALL_TRUSTED Did not pass through any untrusted hosts
* 1.7 AMATEUR_PORN BODY: Possible porn - Amateur Porn
* 2.8 MILLION_USD BODY: Talks about millions of dollars
* 0.1 SUBJ_2_CREDIT BODY: Contains 'subject to credit approval'
* 0.1 DEAR_FRIEND BODY: Dear Friend? That's not very dear!
* 0.4 US_DOLLARS_3 BODY: Mentions millions of $ ($NN,NNN,NNN.NN)
* 1.6 PORN_URL_MISC URI: URL uses words/phrases which indicate porn (misc)
* 1.9 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
* [score: 1.0000]
* 0.2 DRUGS_ERECTILE Refers to an erectile drug
* 15 MISSING_SUBJECT Missing Subject: header
* 0.0 UPPERCASE_50_75 message body is 50-75% uppercase
* 0.6 NIGERIAN_BODY2 Message body looks like a Nigerian spam message 2+
* 15 NIGERIAN_BODY1 Message body looks like a Nigerian spam message 1+
* 1.1 INVALID_MSGID Message-Id is not valid, according to RFC 2822
* 2.7 NIGERIAN_BODY4 Message body looks like a Nigerian spam message 4+
* 2.3 LONGWORDS Long string of long words
* 0.1 NIGERIAN_BODY3 Message body looks like a Nigerian spam message 3+
My questions are :
My main question
1) I still can't see a mail with the header containing autolearn=spam.
It seems that this spam should feed databases as spam because :
- it has more than 3 points from the header and more than 3 points from
the body
- the score is more than 12 points (bayes_auto_learn_threshold_spam
12.0)
However if the score of the mail is less than 0.1, autolearning works
correctly
( in the header it can see autolearn=ham ).
I suppose autolearning with spam doesn't work properly (????)
And the other ones :
2) There are differences beetwen scores of tests in the Content analysis
details and
in the header ( see above ).
For example, FROM_NO_LOWER test has 1.3 pts in Content analysis details
and 0.4 in
the header ; in Content analysis details it can't see BAYES_99 BODY test
at all,
but in the header you can see this test.
Why ?
3) I added the following lines to local.cf :
rewrite_subject 1
subject_tag *****SPAM*****
use_terse_report 0
auto_learn 1
Now, if I run spamassassin -D --lint
I find the statements :
config: SpamAssassin failed to parse line, skipping: rewrite_subject
1
config: SpamAssassin failed to parse line, skipping: subject_tag
*****SPAM*****
config: SpamAssassin failed to parse line, skipping: use_terse_report
0
config: SpamAssassin failed to parse line, skipping: auto_learn
1
config: SpamAssassin failed to parse line, skipping: rewrite_subject
1
config: SpamAssassin failed to parse line, skipping: subject_tag
*****SPAM*****
config: SpamAssassin failed to parse line, skipping: use_terse_report
0
What does it mean ?
Regards
Mirek Wasik