You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Theo Van Dinter <fe...@apache.org> on 2007/07/03 04:37:59 UTC
Inflated hit-frequencies results?
I was perusing my weekly run data and right up at the top:
91.555 108.0930 0.0000 1.000 1.00 0.00 RCVD_IN_PBL
which obviously is very wrong:
$ grep -c RCVD_IN_PBL ham.log spam.log
ham.log:0
spam.log:78728
$ wc -l ham.log spam.log
25832 ham.log
142982 spam.log
168814 total
so it should just be 55.0615. I'll try to look into it, but just in case
anyone else goes "aha!" ... :)
--
Randomly Selected Tagline:
"You can see I'm having a good time with this... Oh, here it is."
- Prof. Farr
Re: Inflated hit-frequencies results?
Posted by Theo Van Dinter <fe...@apache.org>.
On Tue, Jul 03, 2007 at 02:52:01PM -0500, Michael Parker wrote:
> I'm guessing its either a) a bug in the reuse code or b) improper use of
> the reuse config stuff.
I couldn't reproduce the problem with a --net --reuse run, and eventually
decided to go the other route -- I added a per-result cache of rule hits in
hit-frequencies. So now it'll only single-count rule hits at the cost of more
time per result, though it's not large.
--
Randomly Selected Tagline:
A novice was trying to fix a broken lisp machine by turning the
power off and on. Knight, seeing what the student was doing spoke sternly,
"You cannot fix a machine by just power-cycling it with no understanding
of what is going wrong." Knight turned the machine off and on. The
machine worked.
Re: Inflated hit-frequencies results?
Posted by Michael Parker <pa...@pobox.com>.
Theo Van Dinter wrote:
> Huh! The problem is the usual one: GIGO
>
> Y 8 .../8bcaeebfaa ...,RCVD_IN_PBL,RCVD_IN_PBL,...
>
> It's assumed that the rule list should have a unique set of names, so
> hit-frequencies just adds the entry twice.
>
> So now the question is: why does mass-check put the same rule in multiple
> times, and apparently only for weekly runs, and apparently only for this
> rule (pcregrep '([A-Z0-9_]+),\1(,|$)', shows only this rule duplicating)?
> <sigh>
I'm guessing its either a) a bug in the reuse code or b) improper use of
the reuse config stuff.
Michael
>
>
> On Tue, Jul 03, 2007 at 12:02:12PM -0400, Theo Van Dinter wrote:
>> On Tue, Jul 03, 2007 at 10:24:02AM +0100, Justin Mason wrote:
>>> no "aha"s here unfortunately :( -- is this in your own local freqs,
>>> or the freqs on the server (with everyone else's logs too)?
>> This is from hit-frequencies off of my net-theo weekly logs.
>>
>> It's very reproducable too:
>>
>> ~/SA/spamassassin-head/masses/hit-frequencies -a -c \
>> ~corpus/SA/spamassassin-corpora/rules -x -p | awk \
>> '$1 > 100 || $2 > 100 || $3 > 100'
>> OVERALL SPAM% HAM% S/O RANK SCORE NAME
>> 0 142976 25826 0.847 0.00 0.00 (all messages)
>> 91.555 108.0930 0.0000 1.000 1.00 0.00 RCVD_IN_PBL
>>
>> and doing a little bit of debugging yesterday, the spam count for that rule
>> goes to 154547. I just haven't figured out why yet though.
>>
>> --
>> Randomly Selected Tagline:
>> "Our users will know fear and cower before our software! Ship it!
>> Ship it and let them flee like the dogs they are!"
>> - Klingon Programmer's Manual
>
>
>
--
Sent from my iPhone
Re: Inflated hit-frequencies results?
Posted by Theo Van Dinter <fe...@apache.org>.
Huh! The problem is the usual one: GIGO
Y 8 .../8bcaeebfaa ...,RCVD_IN_PBL,RCVD_IN_PBL,...
It's assumed that the rule list should have a unique set of names, so
hit-frequencies just adds the entry twice.
So now the question is: why does mass-check put the same rule in multiple
times, and apparently only for weekly runs, and apparently only for this
rule (pcregrep '([A-Z0-9_]+),\1(,|$)', shows only this rule duplicating)?
<sigh>
On Tue, Jul 03, 2007 at 12:02:12PM -0400, Theo Van Dinter wrote:
> On Tue, Jul 03, 2007 at 10:24:02AM +0100, Justin Mason wrote:
> > no "aha"s here unfortunately :( -- is this in your own local freqs,
> > or the freqs on the server (with everyone else's logs too)?
>
> This is from hit-frequencies off of my net-theo weekly logs.
>
> It's very reproducable too:
>
> ~/SA/spamassassin-head/masses/hit-frequencies -a -c \
> ~corpus/SA/spamassassin-corpora/rules -x -p | awk \
> '$1 > 100 || $2 > 100 || $3 > 100'
> OVERALL SPAM% HAM% S/O RANK SCORE NAME
> 0 142976 25826 0.847 0.00 0.00 (all messages)
> 91.555 108.0930 0.0000 1.000 1.00 0.00 RCVD_IN_PBL
>
> and doing a little bit of debugging yesterday, the spam count for that rule
> goes to 154547. I just haven't figured out why yet though.
>
> --
> Randomly Selected Tagline:
> "Our users will know fear and cower before our software! Ship it!
> Ship it and let them flee like the dogs they are!"
> - Klingon Programmer's Manual
--
Randomly Selected Tagline:
"I would never have sex with a cow. Cause that is wrong, and I am
lactose intolerant." - Dave Attell
Re: Inflated hit-frequencies results?
Posted by Theo Van Dinter <fe...@apache.org>.
On Tue, Jul 03, 2007 at 10:24:02AM +0100, Justin Mason wrote:
> no "aha"s here unfortunately :( -- is this in your own local freqs,
> or the freqs on the server (with everyone else's logs too)?
This is from hit-frequencies off of my net-theo weekly logs.
It's very reproducable too:
~/SA/spamassassin-head/masses/hit-frequencies -a -c \
~corpus/SA/spamassassin-corpora/rules -x -p | awk \
'$1 > 100 || $2 > 100 || $3 > 100'
OVERALL SPAM% HAM% S/O RANK SCORE NAME
0 142976 25826 0.847 0.00 0.00 (all messages)
91.555 108.0930 0.0000 1.000 1.00 0.00 RCVD_IN_PBL
and doing a little bit of debugging yesterday, the spam count for that rule
goes to 154547. I just haven't figured out why yet though.
--
Randomly Selected Tagline:
"Our users will know fear and cower before our software! Ship it!
Ship it and let them flee like the dogs they are!"
- Klingon Programmer's Manual