You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by da...@chaosreigns.com on 2012/02/14 22:30:27 UTC

pyzor_max

When was the last time an optimal value for this threshold was checked?

It currently defaults to 5.  It's the number of reports to pyzor required
to trigger a hit on PYZOR_CHECK.  I *think* in this context a report
just means "I got this email", not "This is spam."

Increasing it should reduce false positives (while probably decreasing true
positives).

Thinking about this, and our inability to keep automatcially generated
scores in order, it seems like it would be useful to have tests like:

PYZOR_CHECK_005  - At least 5 reports
PYZOR_CHECK_020  - At least 20 reports.
PYZOR_CHECK_040  - etc..
PYZOR_CHECK_160

Which are cumulative.  So if pyzor says 40 reports, you hit all of the
first 3 rules.  Seems like the re-scorer should do useful things with that?


I came across this stuff after finding this in pyzor's mailing list
archives from two months ago:

"I think the SA plug-in will believe Pyzor if there is any number of
reports, but more judicious decisions can be made if the report count
is taken into consideration (however, since Pyzor's just another rule
inside of SA, that perhaps isn't necessary in that context)."
http://sourceforge.net/mailarchive/forum.php?thread_name=5A6FB571-CCAB-4766-939C-E3CCA75FA370%40spamexperts.com&forum_name=pyzor-users

This person's belief was incorrect, but I'm still curious about improving
the accuracy from it.


Report counts on my most recent non-spam hits:
public.pyzor.org:24441  (200, 'OK')     460     0
public.pyzor.org:24441  (200, 'OK')     749     0
public.pyzor.org:24441  (200, 'OK')     749     0
public.pyzor.org:24441  (200, 'OK')     460     0
public.pyzor.org:24441  (200, 'OK')     460     0

Report counts on my most recent spam hits with scores over 10:
public.pyzor.org:24441  (200, 'OK')     20817   0
public.pyzor.org:24441  (200, 'OK')     1705    0
public.pyzor.org:24441  (200, 'OK')     363     0
public.pyzor.org:24441  (200, 'OK')     29      0
public.pyzor.org:24441  (200, 'OK')     21812   0

The varying number (460, 749, etc.) is the number of reports.  The 0 at the
end is the whitelisting count.  I don't know if it's ever actually used.
I'd be curious to see the statistics I could gather from putting this stuff
in a header.

-- 
"The price of freedom is the willingness to do sudden battle, anywhere,
at any time, and with utter recklessness." - Robert A. Heinlein
http://www.ChaosReigns.com

Re: pyzor_max

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 2/14/2012 4:30 PM, darxus@chaosreigns.com wrote:
> When was the last time an optimal value for this threshold was checked?
>
> It currently defaults to 5.  It's the number of reports to pyzor required
> to trigger a hit on PYZOR_CHECK.  I *think* in this context a report
> just means "I got this email", not "This is spam."
>
> Increasing it should reduce false positives (while probably decreasing true
> positives).
>
> Thinking about this, and our inability to keep automatcially generated
> scores in order, it seems like it would be useful to have tests like:
>
> PYZOR_CHECK_005  - At least 5 reports
> PYZOR_CHECK_020  - At least 20 reports.
> PYZOR_CHECK_040  - etc..
> PYZOR_CHECK_160
>
> Which are cumulative.  So if pyzor says 40 reports, you hit all of the
> first 3 rules.  Seems like the re-scorer should do useful things with that?
>
>
> I came across this stuff after finding this in pyzor's mailing list
> archives from two months ago:
>
> "I think the SA plug-in will believe Pyzor if there is any number of
> reports, but more judicious decisions can be made if the report count
> is taken into consideration (however, since Pyzor's just another rule
> inside of SA, that perhaps isn't necessary in that context)."
> http://sourceforge.net/mailarchive/forum.php?thread_name=5A6FB571-CCAB-4766-939C-E3CCA75FA370%40spamexperts.com&forum_name=pyzor-users
>
> This person's belief was incorrect, but I'm still curious about improving
> the accuracy from it.
>
>
> Report counts on my most recent non-spam hits:
> public.pyzor.org:24441  (200, 'OK')     460     0
> public.pyzor.org:24441  (200, 'OK')     749     0
> public.pyzor.org:24441  (200, 'OK')     749     0
> public.pyzor.org:24441  (200, 'OK')     460     0
> public.pyzor.org:24441  (200, 'OK')     460     0
>
> Report counts on my most recent spam hits with scores over 10:
> public.pyzor.org:24441  (200, 'OK')     20817   0
> public.pyzor.org:24441  (200, 'OK')     1705    0
> public.pyzor.org:24441  (200, 'OK')     363     0
> public.pyzor.org:24441  (200, 'OK')     29      0
> public.pyzor.org:24441  (200, 'OK')     21812   0
>
> The varying number (460, 749, etc.) is the number of reports.  The 0 at the
> end is the whitelisting count.  I don't know if it's ever actually used.
> I'd be curious to see the statistics I could gather from putting this stuff
> in a header.
>


Looks like useful line of thought.  I don't use pyzor but might be a 
good thing to open a ticket and perhaps tag for GSOC.

regards,
KAM