You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jim Maul <jm...@elih.org> on 2004/08/06 20:03:27 UTC

autolearn=ham when it shouldnt

For some reason, i have a message that SA insists on autolearning as ham when
its score is clearly above the ham autolearn threshold.

I even put

bayes_auto_learn_threshold_nonspam 0.1

in my local.cf just in case.

X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on external.elih.org
X-Spam-Level: *
X-Spam-Status: No, hits=1.1 required=5.0 tests=CLICK_BELOW,DEAR_SOMETHING,
        HTML_LINK_CLICK_HERE,HTML_MESSAGE,HTML_WEB_BUGS,INVALID_MSGID,
        RCVD_IN_BSP_TRUSTED autolearn=ham version=2.63

1.1 is higher than 0.1 so why the autolearn?

Checking the debug, i notice this:

debug: auto-learn? ham=0.1, spam=10, body-hits=-1.465, head-hits=-4.3
debug: auto-learn: currently using scoreset 3.  recomputing score based on
scoreset 1.
debug: Score set 1 chosen.
debug: auto-learn: original score: 1.134, recomputed score: -0.655
debug: Score set 3 chosen.
debug: auto-learn? yes, ham (-0.655 < 0.1)
debug: Learning Ham

Where is it getting this -0.655?  the score shown above is 1.1?

Any ideas?

Jim

Re: autolearn=ham when it shouldnt

Posted by Daniel Quinlan <qu...@pathname.com>.
Jim Maul <jm...@elih.org> writes:

> I dont have any trusted networks configured.  Never even bothered with
> it.  The sender was fool.com.
> 
> Does anyone know if they are really trusted?

They are a Bonded Sender, see http://www.bondedsender.org/ for the terms
of the program.  I assume you signed up with the Motley Fool at some
point.

> It seems weird to me that they would be trusted and also send out
> messages that hit INVALID_MSGID.

It could be an overaggressive rule, the score is not that high, after
all.  Or, their software could be somewhat broken, that's not exactly
unusual.
 
> I guess anything is possible however.
> 
> So now i guess my question is.  How can i prevent this type of thing
> in the future?  I dont want a message that hits a mess of other
> positive rules to be autolearned.  Im afraid ham messages of this
> spammy nature are going to influence my bayes database in a negative
> way.

It's working as designed and you WANT to learn on this sort of ham.  If
you only learned on completely RFC-2822 conforming and plain-looking
ham, then Bayes wouldn't work so well.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: autolearn=ham when it shouldnt

Posted by Matt Kettler <mk...@evi-inc.com>.
At 02:32 PM 8/6/2004, Jim Maul wrote:
>I dont have any trusted networks configured.  Never even bothered with 
>it.  The
>sender was fool.com.
>
>Does anyone know if they are really trusted?

IMO, fool.com is VERY trustworthy. They are a financial advice site and 
have been running since 1993 and are quite tech savy and spam-aware. They 
are a very reputable company and have dozens of books in print available 
for sale everywhere.

http://www.amazon.com/exec/obidos/tg/detail/-/0743229991/qid=1091817599/sr=8-2/ref=sr_8_xs_ap_i2_xgl14/103-1445742-0704632?v=glance&s=books&n=507846

Here's a bio on the creators:
http://www.npr.org/about/people/bios/dtgardner.html


All their mailing lists are also confirmed double-opt-in lists. You have to 
click a URL that's in an email they send you to remain on their mailing lists.

Sounds like someone signed up for a newsletter at fool.com and forgot they 
did.



Re: autolearn=ham when it shouldnt

Posted by Matt Kettler <mk...@evi-inc.com>.
At 11:27 AM 8/10/2004, Jim Maul wrote:
>X-Spam-Status: No, hits=1.1 required=5.0 tests=CLICK_BELOW,DEAR_SOMETHING,
>          HTML_LINK_CLICK_HERE,HTML_MESSAGE,HTML_WEB_BUGS,INVALID_MSGID,
>          RCVD_IN_BSP_TRUSTED autolearn=ham version=2.63
>
>
>What i was concerned with was things like click below, dear something, html
>messages, web bugs, etc getting into bayes as ham.  Im not saying the message
>isnt ham, i just dont want bayes getting confused.

Since you're still apprehensive, let me fill in some extra details here. 
Please don't take my wording as an intent to insult, my intent is to bring 
across a point in a clear, if a bit blunt, manner.

         1) none of those are very good spamsigns, despite what you might 
think from their names.
         2) bayes doesn't learn SA rules, so what rules hit won't affect 
bayes per-se
         3) bayes won't even see the HTML code of a message. That's all 
stripped before bayes examines the message.
         4) again, it's all about realistic. Not what you perceive as being 
"possibly spam like nonspam".


To justify my statement in 1) let's look at the STATISTICS.txt data for 
those rules, focusing on the simple metric of S/O.

S/O is the ratio of spam hits to overall hits for a given rule. It 
literally represents what percentage of a rule's hits are spam in the 
corpus tests. A rule with a S/O of 0.90 has 90% of it's hits being spam, 
and 10% being nonspam.

STATISTICS.txt is a table of results for the mass-check corpus test that 
was used to generate the scores for a particular version of SA. 2.63's 
corpus run consisted of 543,473 messages, a fairly decent sized statistical 
sampling of email. It's not a perfect statistical sample, but it's 
certainly not grossly undersized.


Here's the results from 2.63's STATISTICS.txt (I've trimmed off the first 
few columns so we can look at S/O easily)
  S/O    RANK   SCORE  NAME
  0.902   0.75    0.10  CLICK_BELOW
  0.880   0.65    1.61  DEAR_SOMETHING
  0.953   0.87    0.10  HTML_LINK_CLICK_HERE
  0.896   0.81    0.16  HTML_MESSAGE
  0.964   0.86    1.12  HTML_WEB_BUGS
  0.957   0.83    1.17  INVALID_MSGID

Not bad, but none are altogether impressive. Two of these have more than 
10% of their hits being nonspam, and the best of the lot has 3.6%. The 
worst has 12%.

Now here's the results from the recent SA 3.0-pre4's STATISTICS.txt:

  S/O    RANK   SCORE  NAME
  0.687   0.29    0.01  CLICK_BELOW
  0.857   0.40    1.23  DEAR_SOMETHING
  0.832   0.30    0.01  HTML_LINK_CLICK_HERE
  0.908   0.34    0.01  HTML_MESSAGE
  0.896   0.39    0.37  HTML_WEB_BUGS
  0.890   0.43    1.08  INVALID_MSGID

Ouch. It would appear that given current trends in email, all of these 
rules are very poor performers indeed! 5 of the 6 are over 10%, and the 
other is not far behind. The best of the lot has 9.2% of its hits being 
nonspam messages. The worst of them has 32% of its hits being nonspam!

Clearly your message is not statistically that "out of line" for a nonspam 
message. Clearly all of these rules have high enough false positive rates 
that it's not unexpected for nonspam messages to hit them.

In general, I'd still emphasize that you need to focus less on being 
worried about poisoning bayes by feeding it all of your mail, and more 
worried about poisoning it with your preconceptions of what it needs to 
see. Bayes really is designed for the real world, you don't need to isolate 
it from the facts of reality.

Bayes is a very broad statistics-based tool that tokenizes nearly every 
word in an email it learns. Little disturbances like this might bother you, 
but they'll hardly influence bayes at all.  Bayes naturally accommodates 
"neutral" tokens which appear in both spam and nonspam. A word like 
"kitten" could appear in a child's email, or a porn spam. Bayes as a result 
will learn this token is neutral. On the other hand, "teenxxx" is not 
likely to appear in anything but spam, and bayes will learn to recognize 
that as important. Bayes treats each and every token it finds as a 
statistic with a percentage chance of spam, then looks at the overall 
collection of them when making it's decisions. It doesn't look at just one 
or two things in an email, it usually looks at more like 15-30.

Sometime run a message through spamassassin -D, you'll get a better 
understanding of just how much it looks at. Look for some lines like these, 
there should be several:


debug: bayes token 'FEATURED' => 0.00273649317207415
debug: bayes token 'pdf' => 0.00511688439191974
debug: bayes token 'UD:pdf' => 0.00539832400516067
debug: bayes token 'ranges' => 0.00659281885375479
debug: bayes token 'RED' => 0.00664197530864198

(the message I took this from had 152 tokens it matched. It's 36kbytes, so 
it's a bit long, but you should get the idea that bayes examines lots of 
things, not just a few. Another message, 861 bytes, including headers, with 
only 1 line of body text, hit 8 tokens)




Re: autolearn=ham when it shouldnt

Posted by Jim Maul <jm...@elih.org>.
Quoting Matt Kettler <mk...@evi-inc.com>:

> At 02:32 PM 8/6/2004, Jim Maul wrote:
>> So now i guess my question is.  How can i prevent this type of thing in the
>> future?  I dont want a message that hits a mess of other positive 
>> rules to be
>> autolearned.  Im afraid ham messages of this spammy nature are going to
>> influence my bayes database in a negative way.
>
> What's so spammy about the message that bayes will notice? clearly the
> INVALID_MSGID is irrelevant here, and that's the only thing you can point
> out that's even remotely "spammy" about the message.
>

Well defining "spammy" is difficult to do.  Honestly, i was basing "spammy"
solely on my perception of the message.  Looks kinda "spammy" to me.  However,
there are other rules that i did not mention, INVALID_MSGID was not the only
one.

X-Spam-Status: No, hits=1.1 required=5.0 tests=CLICK_BELOW,DEAR_SOMETHING,
         HTML_LINK_CLICK_HERE,HTML_MESSAGE,HTML_WEB_BUGS,INVALID_MSGID,
         RCVD_IN_BSP_TRUSTED autolearn=ham version=2.63


What i was concerned with was things like click below, dear something, html
messages, web bugs, etc getting into bayes as ham.  Im not saying the message
isnt ham, i just dont want bayes getting confused.

> Quite frankly, I use fool's mailing lists as part of my forced-training on
> ham. Every day I pump their messages into my ham training as a part of an
> automated cron job. I do the same for CNN and several other news sites that
> my users use.
>
> (that said, I've not seen any of the fool messages trigger INVALID_MSGID on
> my system)
>
> I think you're being massively over-paranoid. Bayes isn't poisoned so
> easily, and quite frankly, the message IS nonspam and IS very typical of
> real-world html newsletters sent out by legitimate companies all the time.
>
> If SA's bayes engine would be so radically upset by real-world email, it
> wouldn't work at all well. It's actually important to feed your bayes DB
> "spammy ham", otherwise your bayes database is not going to work very well
> when it comes to distinguishing such subtleties.


So i guess i'll shut up and let it do its thing :)

Thanks,

Jim

Re: autolearn=ham when it shouldnt

Posted by Matt Kettler <mk...@evi-inc.com>.
At 02:32 PM 8/6/2004, Jim Maul wrote:
>So now i guess my question is.  How can i prevent this type of thing in the
>future?  I dont want a message that hits a mess of other positive rules to be
>autolearned.  Im afraid ham messages of this spammy nature are going to
>influence my bayes database in a negative way.

What's so spammy about the message that bayes will notice? clearly the 
INVALID_MSGID is irrelevant here, and that's the only thing you can point 
out that's even remotely "spammy" about the message.

Quite frankly, I use fool's mailing lists as part of my forced-training on 
ham. Every day I pump their messages into my ham training as a part of an 
automated cron job. I do the same for CNN and several other news sites that 
my users use.

(that said, I've not seen any of the fool messages trigger INVALID_MSGID on 
my system)

I think you're being massively over-paranoid. Bayes isn't poisoned so 
easily, and quite frankly, the message IS nonspam and IS very typical of 
real-world html newsletters sent out by legitimate companies all the time.

If SA's bayes engine would be so radically upset by real-world email, it 
wouldn't work at all well. It's actually important to feed your bayes DB 
"spammy ham", otherwise your bayes database is not going to work very well 
when it comes to distinguishing such subtleties.

It's like training a security guard to recognize criminals by only showing 
them pictures of suit wearing accountants and prison inmates. What are they 
going to do when they see a construction worker? Looks more like an inmate 
than an accountant, must be a criminal. Come to think of it, which do YOU 
look more like?

Don't isolate your bayes database from reality by giving it false 
impressions of spam and ham. This does much more harm than good. Teach it 
everything you can about ham and spam, and keep it realistic. The bayes DB 
will be able to make much better decisions if you do.



Re: autolearn=ham when it shouldnt

Posted by Jim Maul <jm...@elih.org>.
Quoting Matt Kettler <mk...@evi-inc.com>:

> At 02:03 PM 8/6/2004, Jim Maul wrote:
>
>> Where is it getting this -0.655?  the score shown above is 1.1?
>
> It's recomputing the score of the message as if bayes were disabled, which
> is why it jumped to score set 1, and then back to 3.
>
> This behavior is intentional to prevent bayes rules from self feeding, and
> is in the documentation.
>
> Probably the biggest situation you've got on your hands is why did the
> message hit RCVD_IN_BSP_TRUSTED. Was it *really* sent by a bondedsender
> listed server, or is your trusted_networks misconfigured?

I dont have any trusted networks configured.  Never even bothered with 
it.  The
sender was fool.com.

Does anyone know if they are really trusted?  It seems weird to me that they
would be trusted and also send out messages that hit INVALID_MSGID.

I guess anything is possible however.

So now i guess my question is.  How can i prevent this type of thing in the
future?  I dont want a message that hits a mess of other positive rules to be
autolearned.  Im afraid ham messages of this spammy nature are going to
influence my bayes database in a negative way.

Thanks,

Jim

Re: autolearn=ham when it shouldnt

Posted by Matt Kettler <mk...@evi-inc.com>.
At 02:03 PM 8/6/2004, Jim Maul wrote:

>Where is it getting this -0.655?  the score shown above is 1.1?

It's recomputing the score of the message as if bayes were disabled, which 
is why it jumped to score set 1, and then back to 3.

This behavior is intentional to prevent bayes rules from self feeding, and 
is in the documentation.

Probably the biggest situation you've got on your hands is why did the 
message hit RCVD_IN_BSP_TRUSTED. Was it *really* sent by a bondedsender 
listed server, or is your trusted_networks misconfigured?



RE: autolearn=ham when it shouldnt

Posted by Jim Maul <jm...@elih.org>.
Quoting Bret Miller <br...@wcg.org>:

>
> Maybe AWL?
>
> Bret


I dont think so.  I dont have any whitelisting enabled...unless auto whitelist
is enabled by default?

Jim

RE: autolearn=ham when it shouldnt

Posted by Bret Miller <br...@wcg.org>.

> -----Original Message-----
> From: Jim Maul [mailto:jmaul@elih.org] 
> Sent: Friday, August 06, 2004 11:03 AM
> To: spamassassin-users@incubator.apache.org
> Subject: autolearn=ham when it shouldnt
> 
> 
> For some reason, i have a message that SA insists on 
> autolearning as ham when
> its score is clearly above the ham autolearn threshold.
> 
> I even put
> 
> bayes_auto_learn_threshold_nonspam 0.1
> 
> in my local.cf just in case.
> 
> X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on 
> external.elih.org
> X-Spam-Level: *
> X-Spam-Status: No, hits=1.1 required=5.0 
> tests=CLICK_BELOW,DEAR_SOMETHING,
>         HTML_LINK_CLICK_HERE,HTML_MESSAGE,HTML_WEB_BUGS,INVALID_MSGID,
>         RCVD_IN_BSP_TRUSTED autolearn=ham version=2.63
> 
> 1.1 is higher than 0.1 so why the autolearn?
> Checking the debug, i notice this:
> 
> debug: auto-learn? ham=0.1, spam=10, body-hits=-1.465, head-hits=-4.3
> debug: auto-learn: currently using scoreset 3.  recomputing 
> score based on
> scoreset 1.
> debug: Score set 1 chosen.
> debug: auto-learn: original score: 1.134, recomputed score: -0.655
> debug: Score set 3 chosen.
> debug: auto-learn? yes, ham (-0.655 < 0.1)
> debug: Learning Ham
> 
> Where is it getting this -0.655?  the score shown above is 1.1?
> 
> Any ideas?

Maybe AWL?

Bret
----------
 
Send your spam to: bretmiller@wcg.org
Thanks for keeping the internet spam-free!