You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jim Maul <jm...@elih.org> on 2004/08/06 20:03:27 UTC
autolearn=ham when it shouldnt
For some reason, i have a message that SA insists on autolearning as ham when
its score is clearly above the ham autolearn threshold.
I even put
bayes_auto_learn_threshold_nonspam 0.1
in my local.cf just in case.
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on external.elih.org
X-Spam-Level: *
X-Spam-Status: No, hits=1.1 required=5.0 tests=CLICK_BELOW,DEAR_SOMETHING,
HTML_LINK_CLICK_HERE,HTML_MESSAGE,HTML_WEB_BUGS,INVALID_MSGID,
RCVD_IN_BSP_TRUSTED autolearn=ham version=2.63
1.1 is higher than 0.1 so why the autolearn?
Checking the debug, i notice this:
debug: auto-learn? ham=0.1, spam=10, body-hits=-1.465, head-hits=-4.3
debug: auto-learn: currently using scoreset 3. recomputing score based on
scoreset 1.
debug: Score set 1 chosen.
debug: auto-learn: original score: 1.134, recomputed score: -0.655
debug: Score set 3 chosen.
debug: auto-learn? yes, ham (-0.655 < 0.1)
debug: Learning Ham
Where is it getting this -0.655? the score shown above is 1.1?
Any ideas?
Jim
Re: autolearn=ham when it shouldnt
Posted by Daniel Quinlan <qu...@pathname.com>.
Jim Maul <jm...@elih.org> writes:
> I dont have any trusted networks configured. Never even bothered with
> it. The sender was fool.com.
>
> Does anyone know if they are really trusted?
They are a Bonded Sender, see http://www.bondedsender.org/ for the terms
of the program. I assume you signed up with the Motley Fool at some
point.
> It seems weird to me that they would be trusted and also send out
> messages that hit INVALID_MSGID.
It could be an overaggressive rule, the score is not that high, after
all. Or, their software could be somewhat broken, that's not exactly
unusual.
> I guess anything is possible however.
>
> So now i guess my question is. How can i prevent this type of thing
> in the future? I dont want a message that hits a mess of other
> positive rules to be autolearned. Im afraid ham messages of this
> spammy nature are going to influence my bayes database in a negative
> way.
It's working as designed and you WANT to learn on this sort of ham. If
you only learned on completely RFC-2822 conforming and plain-looking
ham, then Bayes wouldn't work so well.
Daniel
--
Daniel Quinlan
http://www.pathname.com/~quinlan/
Re: autolearn=ham when it shouldnt
Posted by Matt Kettler <mk...@evi-inc.com>.
At 02:32 PM 8/6/2004, Jim Maul wrote:
>I dont have any trusted networks configured. Never even bothered with
>it. The
>sender was fool.com.
>
>Does anyone know if they are really trusted?
IMO, fool.com is VERY trustworthy. They are a financial advice site and
have been running since 1993 and are quite tech savy and spam-aware. They
are a very reputable company and have dozens of books in print available
for sale everywhere.
http://www.amazon.com/exec/obidos/tg/detail/-/0743229991/qid=1091817599/sr=8-2/ref=sr_8_xs_ap_i2_xgl14/103-1445742-0704632?v=glance&s=books&n=507846
Here's a bio on the creators:
http://www.npr.org/about/people/bios/dtgardner.html
All their mailing lists are also confirmed double-opt-in lists. You have to
click a URL that's in an email they send you to remain on their mailing lists.
Sounds like someone signed up for a newsletter at fool.com and forgot they
did.
Re: autolearn=ham when it shouldnt
Posted by Matt Kettler <mk...@evi-inc.com>.
At 11:27 AM 8/10/2004, Jim Maul wrote:
>X-Spam-Status: No, hits=1.1 required=5.0 tests=CLICK_BELOW,DEAR_SOMETHING,
> HTML_LINK_CLICK_HERE,HTML_MESSAGE,HTML_WEB_BUGS,INVALID_MSGID,
> RCVD_IN_BSP_TRUSTED autolearn=ham version=2.63
>
>
>What i was concerned with was things like click below, dear something, html
>messages, web bugs, etc getting into bayes as ham. Im not saying the message
>isnt ham, i just dont want bayes getting confused.
Since you're still apprehensive, let me fill in some extra details here.
Please don't take my wording as an intent to insult, my intent is to bring
across a point in a clear, if a bit blunt, manner.
1) none of those are very good spamsigns, despite what you might
think from their names.
2) bayes doesn't learn SA rules, so what rules hit won't affect
bayes per-se
3) bayes won't even see the HTML code of a message. That's all
stripped before bayes examines the message.
4) again, it's all about realistic. Not what you perceive as being
"possibly spam like nonspam".
To justify my statement in 1) let's look at the STATISTICS.txt data for
those rules, focusing on the simple metric of S/O.
S/O is the ratio of spam hits to overall hits for a given rule. It
literally represents what percentage of a rule's hits are spam in the
corpus tests. A rule with a S/O of 0.90 has 90% of it's hits being spam,
and 10% being nonspam.
STATISTICS.txt is a table of results for the mass-check corpus test that
was used to generate the scores for a particular version of SA. 2.63's
corpus run consisted of 543,473 messages, a fairly decent sized statistical
sampling of email. It's not a perfect statistical sample, but it's
certainly not grossly undersized.
Here's the results from 2.63's STATISTICS.txt (I've trimmed off the first
few columns so we can look at S/O easily)
S/O RANK SCORE NAME
0.902 0.75 0.10 CLICK_BELOW
0.880 0.65 1.61 DEAR_SOMETHING
0.953 0.87 0.10 HTML_LINK_CLICK_HERE
0.896 0.81 0.16 HTML_MESSAGE
0.964 0.86 1.12 HTML_WEB_BUGS
0.957 0.83 1.17 INVALID_MSGID
Not bad, but none are altogether impressive. Two of these have more than
10% of their hits being nonspam, and the best of the lot has 3.6%. The
worst has 12%.
Now here's the results from the recent SA 3.0-pre4's STATISTICS.txt:
S/O RANK SCORE NAME
0.687 0.29 0.01 CLICK_BELOW
0.857 0.40 1.23 DEAR_SOMETHING
0.832 0.30 0.01 HTML_LINK_CLICK_HERE
0.908 0.34 0.01 HTML_MESSAGE
0.896 0.39 0.37 HTML_WEB_BUGS
0.890 0.43 1.08 INVALID_MSGID
Ouch. It would appear that given current trends in email, all of these
rules are very poor performers indeed! 5 of the 6 are over 10%, and the
other is not far behind. The best of the lot has 9.2% of its hits being
nonspam messages. The worst of them has 32% of its hits being nonspam!
Clearly your message is not statistically that "out of line" for a nonspam
message. Clearly all of these rules have high enough false positive rates
that it's not unexpected for nonspam messages to hit them.
In general, I'd still emphasize that you need to focus less on being
worried about poisoning bayes by feeding it all of your mail, and more
worried about poisoning it with your preconceptions of what it needs to
see. Bayes really is designed for the real world, you don't need to isolate
it from the facts of reality.
Bayes is a very broad statistics-based tool that tokenizes nearly every
word in an email it learns. Little disturbances like this might bother you,
but they'll hardly influence bayes at all. Bayes naturally accommodates
"neutral" tokens which appear in both spam and nonspam. A word like
"kitten" could appear in a child's email, or a porn spam. Bayes as a result
will learn this token is neutral. On the other hand, "teenxxx" is not
likely to appear in anything but spam, and bayes will learn to recognize
that as important. Bayes treats each and every token it finds as a
statistic with a percentage chance of spam, then looks at the overall
collection of them when making it's decisions. It doesn't look at just one
or two things in an email, it usually looks at more like 15-30.
Sometime run a message through spamassassin -D, you'll get a better
understanding of just how much it looks at. Look for some lines like these,
there should be several:
debug: bayes token 'FEATURED' => 0.00273649317207415
debug: bayes token 'pdf' => 0.00511688439191974
debug: bayes token 'UD:pdf' => 0.00539832400516067
debug: bayes token 'ranges' => 0.00659281885375479
debug: bayes token 'RED' => 0.00664197530864198
(the message I took this from had 152 tokens it matched. It's 36kbytes, so
it's a bit long, but you should get the idea that bayes examines lots of
things, not just a few. Another message, 861 bytes, including headers, with
only 1 line of body text, hit 8 tokens)
Re: autolearn=ham when it shouldnt
Posted by Jim Maul <jm...@elih.org>.
Quoting Matt Kettler <mk...@evi-inc.com>:
> At 02:32 PM 8/6/2004, Jim Maul wrote:
>> So now i guess my question is. How can i prevent this type of thing in the
>> future? I dont want a message that hits a mess of other positive
>> rules to be
>> autolearned. Im afraid ham messages of this spammy nature are going to
>> influence my bayes database in a negative way.
>
> What's so spammy about the message that bayes will notice? clearly the
> INVALID_MSGID is irrelevant here, and that's the only thing you can point
> out that's even remotely "spammy" about the message.
>
Well defining "spammy" is difficult to do. Honestly, i was basing "spammy"
solely on my perception of the message. Looks kinda "spammy" to me. However,
there are other rules that i did not mention, INVALID_MSGID was not the only
one.
X-Spam-Status: No, hits=1.1 required=5.0 tests=CLICK_BELOW,DEAR_SOMETHING,
HTML_LINK_CLICK_HERE,HTML_MESSAGE,HTML_WEB_BUGS,INVALID_MSGID,
RCVD_IN_BSP_TRUSTED autolearn=ham version=2.63
What i was concerned with was things like click below, dear something, html
messages, web bugs, etc getting into bayes as ham. Im not saying the message
isnt ham, i just dont want bayes getting confused.
> Quite frankly, I use fool's mailing lists as part of my forced-training on
> ham. Every day I pump their messages into my ham training as a part of an
> automated cron job. I do the same for CNN and several other news sites that
> my users use.
>
> (that said, I've not seen any of the fool messages trigger INVALID_MSGID on
> my system)
>
> I think you're being massively over-paranoid. Bayes isn't poisoned so
> easily, and quite frankly, the message IS nonspam and IS very typical of
> real-world html newsletters sent out by legitimate companies all the time.
>
> If SA's bayes engine would be so radically upset by real-world email, it
> wouldn't work at all well. It's actually important to feed your bayes DB
> "spammy ham", otherwise your bayes database is not going to work very well
> when it comes to distinguishing such subtleties.
So i guess i'll shut up and let it do its thing :)
Thanks,
Jim
Re: autolearn=ham when it shouldnt
Posted by Matt Kettler <mk...@evi-inc.com>.
At 02:32 PM 8/6/2004, Jim Maul wrote:
>So now i guess my question is. How can i prevent this type of thing in the
>future? I dont want a message that hits a mess of other positive rules to be
>autolearned. Im afraid ham messages of this spammy nature are going to
>influence my bayes database in a negative way.
What's so spammy about the message that bayes will notice? clearly the
INVALID_MSGID is irrelevant here, and that's the only thing you can point
out that's even remotely "spammy" about the message.
Quite frankly, I use fool's mailing lists as part of my forced-training on
ham. Every day I pump their messages into my ham training as a part of an
automated cron job. I do the same for CNN and several other news sites that
my users use.
(that said, I've not seen any of the fool messages trigger INVALID_MSGID on
my system)
I think you're being massively over-paranoid. Bayes isn't poisoned so
easily, and quite frankly, the message IS nonspam and IS very typical of
real-world html newsletters sent out by legitimate companies all the time.
If SA's bayes engine would be so radically upset by real-world email, it
wouldn't work at all well. It's actually important to feed your bayes DB
"spammy ham", otherwise your bayes database is not going to work very well
when it comes to distinguishing such subtleties.
It's like training a security guard to recognize criminals by only showing
them pictures of suit wearing accountants and prison inmates. What are they
going to do when they see a construction worker? Looks more like an inmate
than an accountant, must be a criminal. Come to think of it, which do YOU
look more like?
Don't isolate your bayes database from reality by giving it false
impressions of spam and ham. This does much more harm than good. Teach it
everything you can about ham and spam, and keep it realistic. The bayes DB
will be able to make much better decisions if you do.
Re: autolearn=ham when it shouldnt
Posted by Jim Maul <jm...@elih.org>.
Quoting Matt Kettler <mk...@evi-inc.com>:
> At 02:03 PM 8/6/2004, Jim Maul wrote:
>
>> Where is it getting this -0.655? the score shown above is 1.1?
>
> It's recomputing the score of the message as if bayes were disabled, which
> is why it jumped to score set 1, and then back to 3.
>
> This behavior is intentional to prevent bayes rules from self feeding, and
> is in the documentation.
>
> Probably the biggest situation you've got on your hands is why did the
> message hit RCVD_IN_BSP_TRUSTED. Was it *really* sent by a bondedsender
> listed server, or is your trusted_networks misconfigured?
I dont have any trusted networks configured. Never even bothered with
it. The
sender was fool.com.
Does anyone know if they are really trusted? It seems weird to me that they
would be trusted and also send out messages that hit INVALID_MSGID.
I guess anything is possible however.
So now i guess my question is. How can i prevent this type of thing in the
future? I dont want a message that hits a mess of other positive rules to be
autolearned. Im afraid ham messages of this spammy nature are going to
influence my bayes database in a negative way.
Thanks,
Jim
Re: autolearn=ham when it shouldnt
Posted by Matt Kettler <mk...@evi-inc.com>.
At 02:03 PM 8/6/2004, Jim Maul wrote:
>Where is it getting this -0.655? the score shown above is 1.1?
It's recomputing the score of the message as if bayes were disabled, which
is why it jumped to score set 1, and then back to 3.
This behavior is intentional to prevent bayes rules from self feeding, and
is in the documentation.
Probably the biggest situation you've got on your hands is why did the
message hit RCVD_IN_BSP_TRUSTED. Was it *really* sent by a bondedsender
listed server, or is your trusted_networks misconfigured?
RE: autolearn=ham when it shouldnt
Posted by Jim Maul <jm...@elih.org>.
Quoting Bret Miller <br...@wcg.org>:
>
> Maybe AWL?
>
> Bret
I dont think so. I dont have any whitelisting enabled...unless auto whitelist
is enabled by default?
Jim
RE: autolearn=ham when it shouldnt
Posted by Bret Miller <br...@wcg.org>.
> -----Original Message-----
> From: Jim Maul [mailto:jmaul@elih.org]
> Sent: Friday, August 06, 2004 11:03 AM
> To: spamassassin-users@incubator.apache.org
> Subject: autolearn=ham when it shouldnt
>
>
> For some reason, i have a message that SA insists on
> autolearning as ham when
> its score is clearly above the ham autolearn threshold.
>
> I even put
>
> bayes_auto_learn_threshold_nonspam 0.1
>
> in my local.cf just in case.
>
> X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on
> external.elih.org
> X-Spam-Level: *
> X-Spam-Status: No, hits=1.1 required=5.0
> tests=CLICK_BELOW,DEAR_SOMETHING,
> HTML_LINK_CLICK_HERE,HTML_MESSAGE,HTML_WEB_BUGS,INVALID_MSGID,
> RCVD_IN_BSP_TRUSTED autolearn=ham version=2.63
>
> 1.1 is higher than 0.1 so why the autolearn?
> Checking the debug, i notice this:
>
> debug: auto-learn? ham=0.1, spam=10, body-hits=-1.465, head-hits=-4.3
> debug: auto-learn: currently using scoreset 3. recomputing
> score based on
> scoreset 1.
> debug: Score set 1 chosen.
> debug: auto-learn: original score: 1.134, recomputed score: -0.655
> debug: Score set 3 chosen.
> debug: auto-learn? yes, ham (-0.655 < 0.1)
> debug: Learning Ham
>
> Where is it getting this -0.655? the score shown above is 1.1?
>
> Any ideas?
Maybe AWL?
Bret
----------
Send your spam to: bretmiller@wcg.org
Thanks for keeping the internet spam-free!