You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Theo Van Dinter <fe...@apache.org> on 2007/07/03 04:37:59 UTC

Inflated hit-frequencies results?

I was perusing my weekly run data and right up at the top:

 91.555  108.0930   0.0000    1.000   1.00    0.00  RCVD_IN_PBL

which obviously is very wrong:

$ grep -c RCVD_IN_PBL ham.log spam.log
ham.log:0
spam.log:78728
$ wc -l ham.log spam.log
    25832 ham.log
   142982 spam.log
   168814 total

so it should just be 55.0615.  I'll try to look into it, but just in case
anyone else goes "aha!" ... :)

-- 
Randomly Selected Tagline:
"You can see I'm having a good time with this...  Oh, here it is."
                                                    - Prof. Farr

Re: Inflated hit-frequencies results?

Posted by Theo Van Dinter <fe...@apache.org>.
On Tue, Jul 03, 2007 at 02:52:01PM -0500, Michael Parker wrote:
> I'm guessing its either a) a bug in the reuse code or b) improper use of
> the reuse config stuff.

I couldn't reproduce the problem with a --net --reuse run, and eventually
decided to go the other route -- I added a per-result cache of rule hits in
hit-frequencies.  So now it'll only single-count rule hits at the cost of more
time per result, though it's not large.

-- 
Randomly Selected Tagline:
	A novice was trying to fix a broken lisp machine by turning the
 power off and on.  Knight, seeing what the student was doing spoke sternly,
 "You cannot fix a machine by just power-cycling it with no understanding
 of what is going wrong."  Knight turned the machine off and on.  The
 machine worked.

Re: Inflated hit-frequencies results?

Posted by Michael Parker <pa...@pobox.com>.
Theo Van Dinter wrote:
> Huh!  The problem is the usual one: GIGO
> 
> Y  8 .../8bcaeebfaa ...,RCVD_IN_PBL,RCVD_IN_PBL,...
> 
> It's assumed that the rule list should have a unique set of names, so
> hit-frequencies just adds the entry twice.
> 
> So now the question is: why does mass-check put the same rule in multiple
> times, and apparently only for weekly runs, and apparently only for this
> rule (pcregrep '([A-Z0-9_]+),\1(,|$)', shows only this rule duplicating)?
> <sigh>

I'm guessing its either a) a bug in the reuse code or b) improper use of
the reuse config stuff.

Michael


> 
> 
> On Tue, Jul 03, 2007 at 12:02:12PM -0400, Theo Van Dinter wrote:
>> On Tue, Jul 03, 2007 at 10:24:02AM +0100, Justin Mason wrote:
>>> no "aha"s here unfortunately :( -- is this in your own local freqs,
>>> or the freqs on the server (with everyone else's logs too)?
>> This is from hit-frequencies off of my net-theo weekly logs.
>>
>> It's very reproducable too:
>>
>> ~/SA/spamassassin-head/masses/hit-frequencies -a -c \
>> ~corpus/SA/spamassassin-corpora/rules -x -p | awk \
>> '$1 > 100 || $2 > 100 || $3 > 100'
>> OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
>>       0   142976    25826    0.847   0.00    0.00  (all messages)
>>        91.555  108.0930   0.0000    1.000   1.00    0.00  RCVD_IN_PBL
>>
>> and doing a little bit of debugging yesterday, the spam count for that rule
>> goes to 154547.  I just haven't figured out why yet though.
>>
>> -- 
>> Randomly Selected Tagline:
>> "Our users will know fear and cower before our software! Ship it!
>>  Ship it and let them flee like the dogs they are!"
>>          - Klingon Programmer's Manual
> 
> 
> 


-- 
Sent from my iPhone

Re: Inflated hit-frequencies results?

Posted by Theo Van Dinter <fe...@apache.org>.
Huh!  The problem is the usual one: GIGO

Y  8 .../8bcaeebfaa ...,RCVD_IN_PBL,RCVD_IN_PBL,...

It's assumed that the rule list should have a unique set of names, so
hit-frequencies just adds the entry twice.

So now the question is: why does mass-check put the same rule in multiple
times, and apparently only for weekly runs, and apparently only for this
rule (pcregrep '([A-Z0-9_]+),\1(,|$)', shows only this rule duplicating)?
<sigh>


On Tue, Jul 03, 2007 at 12:02:12PM -0400, Theo Van Dinter wrote:
> On Tue, Jul 03, 2007 at 10:24:02AM +0100, Justin Mason wrote:
> > no "aha"s here unfortunately :( -- is this in your own local freqs,
> > or the freqs on the server (with everyone else's logs too)?
> 
> This is from hit-frequencies off of my net-theo weekly logs.
> 
> It's very reproducable too:
> 
> ~/SA/spamassassin-head/masses/hit-frequencies -a -c \
> ~corpus/SA/spamassassin-corpora/rules -x -p | awk \
> '$1 > 100 || $2 > 100 || $3 > 100'
> OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
>       0   142976    25826    0.847   0.00    0.00  (all messages)
>        91.555  108.0930   0.0000    1.000   1.00    0.00  RCVD_IN_PBL
> 
> and doing a little bit of debugging yesterday, the spam count for that rule
> goes to 154547.  I just haven't figured out why yet though.
> 
> -- 
> Randomly Selected Tagline:
> "Our users will know fear and cower before our software! Ship it!
>  Ship it and let them flee like the dogs they are!"
>          - Klingon Programmer's Manual



-- 
Randomly Selected Tagline:
"I would never have sex with a cow.  Cause that is wrong, and I am
 lactose intolerant."            - Dave Attell

Re: Inflated hit-frequencies results?

Posted by Theo Van Dinter <fe...@apache.org>.
On Tue, Jul 03, 2007 at 10:24:02AM +0100, Justin Mason wrote:
> no "aha"s here unfortunately :( -- is this in your own local freqs,
> or the freqs on the server (with everyone else's logs too)?

This is from hit-frequencies off of my net-theo weekly logs.

It's very reproducable too:

~/SA/spamassassin-head/masses/hit-frequencies -a -c \
~corpus/SA/spamassassin-corpora/rules -x -p | awk \
'$1 > 100 || $2 > 100 || $3 > 100'
OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
      0   142976    25826    0.847   0.00    0.00  (all messages)
       91.555  108.0930   0.0000    1.000   1.00    0.00  RCVD_IN_PBL

and doing a little bit of debugging yesterday, the spam count for that rule
goes to 154547.  I just haven't figured out why yet though.

-- 
Randomly Selected Tagline:
"Our users will know fear and cower before our software! Ship it!
 Ship it and let them flee like the dogs they are!"
         - Klingon Programmer's Manual