You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Robert Menschel <Ro...@Menschel.net> on 2005/01/12 16:03:08 UTC

Re[2]: bayes?!

Hello kalin,

Tuesday, January 11, 2005, 6:33:53 PM, you wrote:

km> also....  does the amount of processed messages matter after the initial
km> feed of more or less the same amounts of spam and ham?
km> because the spam piles much faster then miscategorized ham...

Yes, spam piles much faster than ham (correctly or wrongly
categorized).

To have Bayes work best, feed it all the (manually verified) spam and
ham you can. Correcting SA's mistakes is most critical, but feed it
correctly identified ham and spam also if you can.

And don't worry about the ratio -- I feed Bayes spam/ham in a 10:1
ratio, and it's working wonderfully.

Bob Menschel




RE: SA List Subject/From Indicators

Posted by Rob McEwen <ro...@powerviewsystems.com>.
Thanks Bob & Joseph. Good suggestions!

Rob McEwen



Re: SA List Subject/From Indicators

Posted by Bob McClure Jr <ro...@earthlink.net>.
On Thu, Jan 13, 2005 at 04:17:08AM -0500, Rob McEwen wrote:
> RE: SA List Messages Subject/From Indicators
> 
> Often, when I receive messages from the SA list, the FROM displays the name
> of the sender rather than the name of the list. Furthermore, the SUBJECT
> line is often an obvious SA-related phrase... BUT NOT ALWAYS. I find it
> annoying sometimes when this happens... that is, a FROM that is not SA, and
> a SUBJECT line that is also not easily associated with SA. Why? Because the
> subject line can sometimes be something that indicates an emergency and I
> then HAVE to read it ASAP so ensure that one of my own clients isn't having
> an emergency regarding my own services.
> 
> One solution would be to unsubscribe and re-subscribe under a different
> specially created e-mail address... but I do like keeping these SA messages
> "close to home" so I prefer to not go that route.
> 
> Another possible solution would be to have the list server add "SA: " to the
> beginning of each subject line (when not already there).
> 
> Any thoughts? Suggestions?
> 
> Rob McEwen

A useful line in the header of every SA list message is

List-Id: <users.spamassassin.apache.org>

Why not make Outlook filter on that, and put those in a separate box,
say, SA-List?

Cheers,
-- 
Bob McClure, Jr.             Bobcat Open Systems, Inc.
robertmcclure@earthlink.net  http://www.bobcatos.com
Wise men still seek Him.

Re: bayes?!

Posted by kalin mintchev <ka...@el.net>.
would it help if build new dbs?
and use those to check if the debug will see the toks?
would that affect the sa learning process somehow?


>
>> sa-learn --dbpath /var/spamdb/bayes --dump magic
>
> i get this:
>
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0       2852          0  non-token data: nspam
> 0.000          0       2515          0  non-token data: nham
> 0.000          0     116330          0  non-token data: ntokens
> 0.000          0 1104894403          0  non-token data: oldest atime
> 0.000          0 1105570140          0  non-token data: newest atime
> 0.000          0          0          0  non-token data: last journal sync
> atime
> 0.000          0 1105571295          0  non-token data: last expiry atime
> 0.000          0     581418          0  non-token data: last expire atime
> delta
> 0.000          0      46098          0  non-token data: last expire
> reduction count
>
>> what are the file sizes?  are the files writable/readable by the
>> appropriate users?
>
> -rw-r-----   1 root  vchkpw   688128 Jan 12 18:08 bayes_seen
> -rw-r-----   1 root  vchkpw  2146304 Jan 12 18:08 bayes_toks
>
>
>> debug: URIDNSBL: domain "svbrseprs.com" listed (URIBL_SBL):
>> "http://www.spamhaus.org/SBL/sbl.lasso?query=SBL9959"
>> debug: URIDNSBL: query for svbrseprs.com took 3 seconds to look up
>> (sbl.spamhaus.org.:2.208.178.207)
>> debug: URIDNSBL: domain "svbrseprs.com" listed (URIBL_SBL):
>> "http://www.spamhaus.org/SBL/sbl.lasso?query=SBL21893"
>> debug: URIDNSBL: domain "svbrseprs.com" listed (URIBL_SBL):
>> "http://www.spamhaus.org/SBL/sbl.lasso?query=SBL13495"
>> debug: URIDNSBL: query for svbrseprs.com took 3 seconds to look up
>> (sbl.spamhaus.org.:2.199.36.69)
>
> non of that in the debug...
>
> i tried a few other undetected spam messages. same result. all of them
> have uris in them like:
> http://xgnuk.arms2nemesis.com/?TTlsSwFFf0pW6GC
> http://uyg.rxpharmagroup.com/track.asp?c=gi&cg=gi
> or have attachments....
>
>
> thanks...
>
>>
>>
>>
>
>
> --
>
>


-- 



SA List Subject/From Indicators

Posted by Rob McEwen <ro...@powerviewsystems.com>.
RE: SA List Messages Subject/From Indicators

Often, when I receive messages from the SA list, the FROM displays the name
of the sender rather than the name of the list. Furthermore, the SUBJECT
line is often an obvious SA-related phrase... BUT NOT ALWAYS. I find it
annoying sometimes when this happens... that is, a FROM that is not SA, and
a SUBJECT line that is also not easily associated with SA. Why? Because the
subject line can sometimes be something that indicates an emergency and I
then HAVE to read it ASAP so ensure that one of my own clients isn't having
an emergency regarding my own services.

One solution would be to unsubscribe and re-subscribe under a different
specially created e-mail address... but I do like keeping these SA messages
"close to home" so I prefer to not go that route.

Another possible solution would be to have the list server add "SA: " to the
beginning of each subject line (when not already there).

Any thoughts? Suggestions?

Rob McEwen


Re: bayes?!

Posted by kalin mintchev <ka...@el.net>.
> sa-learn --dbpath /var/spamdb/bayes --dump magic

i get this:

0.000          0          3          0  non-token data: bayes db version
0.000          0       2852          0  non-token data: nspam
0.000          0       2515          0  non-token data: nham
0.000          0     116330          0  non-token data: ntokens
0.000          0 1104894403          0  non-token data: oldest atime
0.000          0 1105570140          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync
atime
0.000          0 1105571295          0  non-token data: last expiry atime
0.000          0     581418          0  non-token data: last expire atime
delta
0.000          0      46098          0  non-token data: last expire
reduction count

> what are the file sizes?  are the files writable/readable by the
> appropriate users?

-rw-r-----   1 root  vchkpw   688128 Jan 12 18:08 bayes_seen
-rw-r-----   1 root  vchkpw  2146304 Jan 12 18:08 bayes_toks


> debug: URIDNSBL: domain "svbrseprs.com" listed (URIBL_SBL):
> "http://www.spamhaus.org/SBL/sbl.lasso?query=SBL9959"
> debug: URIDNSBL: query for svbrseprs.com took 3 seconds to look up
> (sbl.spamhaus.org.:2.208.178.207)
> debug: URIDNSBL: domain "svbrseprs.com" listed (URIBL_SBL):
> "http://www.spamhaus.org/SBL/sbl.lasso?query=SBL21893"
> debug: URIDNSBL: domain "svbrseprs.com" listed (URIBL_SBL):
> "http://www.spamhaus.org/SBL/sbl.lasso?query=SBL13495"
> debug: URIDNSBL: query for svbrseprs.com took 3 seconds to look up
> (sbl.spamhaus.org.:2.199.36.69)

non of that in the debug...

i tried a few other undetected spam messages. same result. all of them
have uris in them like:
http://xgnuk.arms2nemesis.com/?TTlsSwFFf0pW6GC
http://uyg.rxpharmagroup.com/track.asp?c=gi&cg=gi
or have attachments....


thanks...

>
>
>


-- 


Re: bayes?!

Posted by kalin mintchev <ka...@el.net>.
thanks Jon..

>
> first, your Bayes rules don't appear to be hitting.

that's what i'm thinking also...

> this could be because you haven't trained enough mail.  you need minimum
200 > ham and 200 spam before they kick in.

i did feed about 2500 messages into both spam and ham the first time, last
sunday. since then i've been feeding about 200 spams (only mine thus far)
a day..

> perl -MCPAN -e 'install Net::DNS'

got it...

> spamassassin -D --lint msg.txt

i run that plus the -C using my own user-confs


> look for lines concerning DNS, URIBL, or bayes.

here is what i get for bayes and URIBL:

debug: bayes: 64292 tie-ing to DB file R/O /var/spamdb/bayes_toks
debug: bayes: 64292 tie-ing to DB file R/O /var/spamdb/bayes_seen
debug: bayes: found bayes db version 3

the corpus line is missing?!?

and for the URIBL:

debug: is Net::DNS::Resolver available? yes
debug: Net::DNS version: 0.48
debug: trying (3) doubleclick.com...
debug: looking up NS for 'doubleclick.com'
debug: NS lookup of doubleclick.com succeeded => Dns available (set
dns_available to hardcode)
debug: is DNS available? 1
debug: decoding: no encoding detected
debug: URIDNSBL: domains to query:
debug: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x89552fc)
implements 'check_post_dnsbl'
debug: is spam? score=0 required=3

didn't see it as spam either...  the message is definitely spam:
Subject: Tylox Click on your dosage and quantity tract
and it's all html...

what am i doing next?

thank you....


>
> example bayes lines that you want to see:
>
> debug: config: read file /usr/share/spamassassin/23_bayes.cf
> debug: bayes: 90539 tie-ing to DB file R/O
> /home/jsd/.spamassassin/bayes_toks
> debug: bayes: 90539 tie-ing to DB file R/O
> /home/jsd/.spamassassin/bayes_seen
> debug: bayes: found bayes db version 3
> debug: bayes corpus size: nspam = 9827, nham = 826
>
> note my corpus size - more than 200 each of spam and ham.
>
> example dns/uribl lines:
>
> debug: is Net::DNS::Resolver available? yes
> debug: Net::DNS version: 0.39
> debug: NS lookup of intel.com succeeded => Dns available (set
> dns_available to hardcode)
> debug: is DNS available? 1
> debug: URIDNSBL: domains to query:
>
>
>
> -jsd-
>


-- 




Re: bayes?!

Posted by Jon Drukman <js...@cluttered.com>.
kalin mintchev wrote:
> X-Spam-Status: Yes, score=4.6 required=3.0 tests=DATE_IN_FUTURE_12_24,
>       DRUGS_ERECTILE,DRUGS_PAIN,FORGED_HOTMAIL_RCVD,MIME_BASE64_TEXT
>       autolearn=no version=3.0.2
> 
> note that the ones that were detected scored 4 - lower than the actual
> default of the recomended 5....
> 
> i'd really appreciate any help to make sa detect at least 90% of incoming
> spam...

first, your Bayes rules don't appear to be hitting.  this could be 
because you haven't trained enough mail.  you need minimum 200 ham and 
200 spam before they kick in.

second, get the URIBL tests working.  they catch an amazing amount of 
spam.  you need a recent version of Net::DNS.  try this as root:

perl -MCPAN -e 'install Net::DNS'

i haven't seen a spam on my system in recent memory that didn't hit one 
or both of bayes + uribl.  in today's log file, out of 1068 identified 
spam messages, BAYES_99 hit 921 and URIBL_* hit 945.  those should be 
your front-line weapons.

save a known spam message to a file like "msg.txt" and then invoke 
spamassassin on it in debug mode like this:

spamassassin -D --lint msg.txt

look for lines concerning DNS, URIBL, or bayes.

example bayes lines that you want to see:

debug: config: read file /usr/share/spamassassin/23_bayes.cf
debug: bayes: 90539 tie-ing to DB file R/O 
/home/jsd/.spamassassin/bayes_toks
debug: bayes: 90539 tie-ing to DB file R/O 
/home/jsd/.spamassassin/bayes_seen
debug: bayes: found bayes db version 3
debug: bayes corpus size: nspam = 9827, nham = 826

note my corpus size - more than 200 each of spam and ham.

example dns/uribl lines:

debug: is Net::DNS::Resolver available? yes
debug: Net::DNS version: 0.39
debug: NS lookup of intel.com succeeded => Dns available (set 
dns_available to hardcode)
debug: is DNS available? 1
debug: URIDNSBL: domains to query:



-jsd-


Re: Re[2]: bayes?!

Posted by kalin mintchev <ka...@el.net>.
thanks Robert...

> And don't worry about the ratio -- I feed Bayes spam/ham in a 10:1
ratio, and it's working wonderfully.

ok.. unfortunately i have to report that for me there isn't much
difference. overnight i got 88 messages in my mailbox. 72 of them were
spam - not detected by sa. in the same period of time there were about 100
- 110 messages correctly tagged as spam. so it's more like 60% of the spam
gets stopped and the rest goes through undetected.
i guess it's still not working correctly....
when i first fed the spam and ham i used messages that were sent directly
to my account. i was expecting much higher rate of spam detection.

i'm experimenting with a vpopmail set up. here are the db permissions:
-rw-r-----   1 root  vchkpw   688128 Jan 11 16:01 bayes_seen
-rw-r-----   1 root  vchkpw  5439488 Jan 11 16:01 bayes_toks
where vchkpw is the vpopmail group.

maybe i'm still doing something wrong...

here is my own user-prefs and below i have examples of X headers of
detected spam and spam that got through:

required_hits   3.00
rewrite_header Subject  [SPAM]
bayes_path      /var/spamdb/bayes
defang_mime     0
use_terse_report        1

(i realize defang_mime and use_terse_report are old directives but i
figured they wont hurt the process - do they?!)


score   ADVERT_CODE     2.00
score   BILL_1618       3.00
score   DATE_IN_PAST_03_06      3.00
score   FORGED_YAHOO_RCVD       3.00
score   INCREASE_SALES  4.00
score   MIME_EXCESSIVE_QP       3.00
score   MONEY_BACK      1.00
score   MORTGAGE_RATES  5.00
score   MSGID_CHARS_WEIRD       3.00
score   MSG_ID_ADDED_BY_MTA     3.00
score   MSG_ID_ADDED_BY_MTA_2   3.00
score   OFFER   3.00
score   ORDER_NOW       3.00
score   PENIS_ENLARGE   5.00
score   PORN_12 5.00
score   PORN_4  3.00
score   SAVE_MONEY      3.00
score   SMTPD_IN_RCVD   3.00
score   SUBJ_MISSING    3.00
score   VIAGRA  5.00
score   WANTS_CREDIT_CARD       3.00


also there is a bunch of white and black listed domains..


headers of spam that GOT THROUGH:

X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on chavo.el.net
X-Spam-Level: **
X-Spam-Status: No, score=2.8 required=3.0 tests=HELO_DYNAMIC_DHCP,
      HELO_DYNAMIC_IPADDR autolearn=no version=3.0.2


X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on chavo.el.net
X-Spam-Level: **
X-Spam-Status: No, score=2.7 required=3.0 tests=DRUGS_ERECTILE,
      DRUGS_ERECTILE_OBFU,HELO_DYNAMIC_DIALIN,RCVD_ILLEGAL_IP autolearn=no
version=3.0.2


this is DETECTED SPAM:

X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on chavo.el.net
X-Spam-Level: ****
X-Spam-Status: Yes, score=4.0 required=3.0 tests=HELO_DYNAMIC_COMCAST,
      INFO_TLD autolearn=no version=3.0.2
MIME-Version: 1.0


X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on chavo.el.net
X-Spam-Level: ****
X-Spam-Status: Yes, score=4.6 required=3.0 tests=DATE_IN_FUTURE_12_24,
      DRUGS_ERECTILE,DRUGS_PAIN,FORGED_HOTMAIL_RCVD,MIME_BASE64_TEXT
      autolearn=no version=3.0.2

note that the ones that were detected scored 4 - lower than the actual
default of the recomended 5....

i'd really appreciate any help to make sa detect at least 90% of incoming
spam...

thank you...