You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2009/05/16 02:04:55 UTC

[Bug 6114] New: SpamCop top spammers and top spamming networks

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

           Summary: SpamCop top spammers and top spamming networks
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Rules
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: apache@khopis.com


70_sc_top200's main merit was that you could give extra points to the bigger
offenders of SpamCop.  It is no longer maintained, but when it was, it would
assign three points to each offender.

SpamCop also publishes the top /8 and /24 CIDR networks by several metrics,
including spam volume.  I've created some rules that examine the top offending
networks at those two levels plus the top offending individual servers (/32
CIDRs) at several thresholds.

This currently exists as an sa-update channel (khop-sc-neighbors at
http://khopesh.com/Anti-spam#sa-update_channels ) that is repopulated from its
source SpamCop/SenderBase data every four hours.  I also have an experimental
DNSBL, which is somewhat nonsensical given the extremely small size of the data
(the generated BIND configuration is 49KB while the generated SA config is only
8.4KB!), but it might facilitate a better test given how rulesqa can't deal
with channels yet...


KHOP_SC_CIDR8 does contain a number of false positives.  For the most part, the
scores are all tried-and-true, though the channel only recently got the /32
offenders and only recently started allowing RCVD_IN_BL_SPAMCOP_NET overlap
with the CIDR tests.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #2 from Karsten Bräckelmann <gu...@rudersport.de>  2009-05-15 17:23:23 PST ---
(In reply to comment #0)
> 70_sc_top200's main merit was [...]

Now that rings some bells. :)  Old rule-set, outdated, deprecated, not updated,
blah blah blah.  All buried somewhere in last months list archives, or
something.

> SpamCop also publishes the top /8 and /24 CIDR networks by several metrics,
> including spam volume.  I've created some rules that examine the top offending
> networks at those two levels plus the top offending individual servers (/32
> CIDRs) at several thresholds.

They do? Nice.  I do recall having some brief look, though didn't find that.
Any pointers, where that is?


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114


Adam Katz <ap...@khopis.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #4446 is|0                           |1
           obsolete|                            |




--- Comment #5 from Adam Katz <ap...@khopis.com>  2009-05-19 10:47:54 PST ---
Created an attachment (id=4448)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4448)
khop-sc-neighbors channel SA config from 2009-05-19 1:45p EDT

Ooops, problem!  My *_TOP_CIDR* rules were actually the bottom.  That's what I
get for refactoring the generating code and not paying enough attention.  The
channel is now fixed, and my recent discovery of the /16 CIDR has been added as
well.

/16 CIDR:
http://www.spamcop.net/w3m?action=map;net=bmaxcnt;mask=16777215;sort=spamcnt

Other SpamCop stats:  http://www.spamcop.net/spamstats.shtml


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #15 from Justin Mason <jm...@jmason.org>  2009-08-27 07:44:39 PST ---
(In reply to comment #14)
> > Keep in mind that this is using data that is 57 days old (May 19, new version
> > attached) for a data set that is very time-specific.  You can see this impact
> > in the hit-rate over time graph, best illustrated by KHOP_SC_TOP_CIDR8, 
> > http://tinyurl.com/ksc3wa  (that's a shot of what it looks like now) - there
> > were almost zero hams on May 19, but the hams spiked up a week later and again
> > for this week.  Who's to say that the problematic entries were present at those
> > times?  We know only that the ham count was best on the day it was released.
> 
> Do we have a means to automatically update these rules on a regular basis?

not unless Adam fancies getting himself an SVN commit bit ;)

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114


Adam Katz <ap...@khopis.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #4448|0                           |1
        is obsolete|                            |




--- Comment #8 from Adam Katz <ap...@khopis.com>  2009-07-15 17:29:42 PST ---
Created an attachment (id=4485)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4485)
khop-sc-neighbors channel SA config from 2009-07-15 8p EDT

> KHOP_SC_TOP_CIDR16 looks good!

Actually, I think most of them look pretty good:

MSECS      SPAM%     HAM%     S/O    RANK   SCORE  NAME
0.00000  12.3655   0.8152   0.938    0.70    0.01  T_KHOP_SC_CIDR8
0.00000  22.1685   0.5394   0.976    0.75    1.00  KHOP_SC_TOP_CIDR8
0.00000   0.3683   0.0000   1.000    0.79    1.00  KHOP_SC_CIDR16
0.00000   0.8412   0.0000   1.000    0.85    1.00  KHOP_SC_TOP_CIDR16
0.00000   0.0129   0.0000   1.000    0.56    0.01  T_KHOP_SC_CIDR24
0.00000   0.0000   0.0000   0.500    0.49    0.01  T_KHOP_SC_TOP_CIDR24
0.00000   0.0909   0.0000   1.000    0.70    1.00  KHOP_SC_TOP200
0.00000   0.3400   0.0000   1.000    0.79    1.00  KHOP_SC_TOP100
0.00000   0.0024   0.0000   1.000    0.50    0.01  T_KHOP_SC_TOP20
0.00000   0.0008   0.0000   1.000    0.49    0.01  T_KHOP_SC_TOP10
0.00000   0.4341   0.0000   1.000   >0.79    0.00  (union of last 4)

Keep in mind that this is using data that is 57 days old (May 19, new version
attached) for a data set that is very time-specific.  You can see this impact
in the hit-rate over time graph, best illustrated by KHOP_SC_TOP_CIDR8, 
http://tinyurl.com/ksc3wa  (that's a shot of what it looks like now) - there
were almost zero hams on May 19, but the hams spiked up a week later and again
for this week.  Who's to say that the problematic entries were present at those
times?  We know only that the ham count was best on the day it was released.

This data suggests that I should either fold TOP10 and TOP20 back into TOP100
and possibly TOP200 (as summed above) or get rid of those single-ip hits
altogether.  I do worry about the length of the regular expression ... though
it's not as long as some of the sought rules.  I've considered fixing it with a
search tree optimization, short circuit groups by octet, so something like
/\b(?:1\.(?:2\.(?:3\.(?:4|5|6)|7\.(?:8|9))))\b/ to match what would otherwise
be /\b(?:1\.2\.3\.4|1\.2\.3\.5|1\.2\.3\.6|1\.2\.7\.8|1\.2\.7\.9)\b/, but either
sa-compile is smart enough to do that for me and/or it isn't worth my time. 
This stuff was mostly just to appease the people who wanted to highly penalize
the top 200 offender list (like the original SARE channel).

Running some math using just SpamCop's numbers, the top200 list's summed
percentage of contributions to their spam total is only 1.356% (or 1.556% if we
assume rounding by truncation with full-blown optimism on the hidden values). 
Adjusting for the fact that RCVD_IN_BL_SPAMCOP_NET only hits 56.7% of the SA
test corpus, we're down to 0.769% (or round that up to 0.883%).  I guess that's
not bad, but it is twice the 0.434% reported above.

I've also noticed that a large number of SA admins don't have DNSEval
functioning properly.  My khop-sc-neighbors channel now compensates for this by
adding the points that would have been expected from those DNSBLs, which you
can see at the very end of the attached latest version.

Now that I know a little more about the ruleqa system (the T_* bit), I'll try
to post more immediate stats on the data from this attachment once it lands; it
should yield results a few days after landing in SVN, right?  Last time missed
a bit in that by the time I found the stats, the data had already grown stale,
as noted in the next week's ham spike detailed above.


Additionally, recall that I assigned a very small number of points to the CIDR8
rules as I was fully expecting some FPs.  I've even scored them a little lower
just in case, clocking in at 0.6 for TOP_CIDR8 and 0.2 for CIDR8.  Perhaps I'm
not reading the score-map right, but 95.77% of the ham hits scored under 3.999
(84.14% scored under 0.999), so a small bump won't make a difference.  Given
the current data, T_KHOP_SC_CIDR8 would only add points to ONE false positive
hit (0.21% of the ham) and even if scored at 2.0, it would create 23 FPs (4.87%
of the 0.8152% of the hams, which is to say 0.0397% of the ham).  Scoring it
1.0 or less wouldn't actually have added any FPs.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #9 from Justin Mason <jm...@jmason.org>  2009-07-16 02:30:00 PST ---
hi Adam -- added that now.  yep, it'll take a day or two to show up.

have you seen
http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/Optimizer.pm
, btw?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #4 from Justin Mason <jm...@jmason.org>  2009-05-16 01:26:44 PST ---
ok, checked in -- let's see how they go ;)


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #12 from Justin Mason <jm...@jmason.org>  2009-07-17 02:09:21 PST ---
(In reply to comment #11)
> Regexp::Assemble looks like the more interesting of the two, even if it's
> easier for me to split() the regexp into pieces and then add() them to the RA
> object.  Sure enough, it was a quick edit (only 8 lines of code, and I think
> the resulting code is cleaner anyway).  My main worry is that the optimization
> is more for regexp size than for performance.

well, it's testable; take a small selection of "test" Received IPs from your
corpus, put them in a perl script, then use Benchmark:

    $_ = "1.2.8.9";

    use Benchmark qw(:all);
    timethese(-2, {
          'R:A' => sub {
            /\b(?:1\.(?:2\.(?:3\.(?:4|5|6)|7\.(?:8|9))))\b/ 
          },
          'plain' => sub {
            /\b(?:1\.2\.3\.4|1\.2\.3\.5|1\.2\.3\.6|1\.2\.7\.8|1\.2\.7\.9)\b/ 
          },
        });


that'll produce a nice little chart telling you which one is faster. (in
that really basic example, it's "plain" if the test IP appears in the list,
or "R:A" if it doesn't, giving a demo of why you want to use better test
data if possible.)

> I've also merged TOP10+TOP20+TOP100+TOP200 into TOP200, which makes its
> definition 2751 characters with a slew of nesting after reduction via
> Regexp::Assemble, which is a thousand more than when it was just a list of
> SpamCop's 101-200 top offenders.
> 
> I'm going to sit on it for a few days before pushing it here just in case it
> doesn't work well (though it's live on my sa-update channel).

Sounds good. +1

> Any comments on my conclusions when I said this?
> > Additionally, recall that I assigned a very small number of points to the
> > CIDR8 rules as I was fully expecting some FPs.  I've even scored them a
> > little lower just in case, clocking in at 0.6 for TOP_CIDR8 and 0.2 for
> > CIDR8.  Perhaps I'm not reading the score-map right, but 95.77% of the ham
> > hits scored under 3.999 (84.14% scored under 0.999), so a small bump won't
> > make a difference.  Given the current data, T_KHOP_SC_CIDR8 would only add
> > points to ONE false positive hit (0.21% of the ham) and even if scored at
> > 2.0, it would create 23 FPs (4.87% of the 0.8152% of the hams, which is to
> > say 0.0397% of the ham).  Scoring it 1.0 or less wouldn't actually have
> > added any FPs.

I'm not sure.

http://ruleqa.spamassassin.org/20090714-r793817-n/T_KHOP_SC_CIDR8/detail

  scoremap  ham:  0  80.68%  668 ********************************
  scoremap  ham:  1   8.09%   67 ***
  scoremap  ham:  2   5.56%   46 **
  scoremap  ham:  3   1.69%   14 
  scoremap  ham:  4   3.02%   25 *
  scoremap  ham:  6   0.85%    7 
  scoremap  ham:  8   0.12%    1 

http://ruleqa.spamassassin.org/20090714-r793817-n/KHOP_SC_TOP_CIDR8/detail

  scoremap  ham:  0  66.49%  252 **************************
  scoremap  ham:  1  21.90%   83 ********
  scoremap  ham:  2   2.37%    9 
  scoremap  ham:  3   1.58%    6 
  scoremap  ham:  4   6.33%   24 **
  scoremap  ham:  5   0.53%    2 
  scoremap  ham:  6   0.79%    3 


The danger is those hits around 4. They may be _just_ under 5 points, in which
case those will be tipped over the FP threshold very easily.

In addition, that kind of "damned by association" rule will be very contentious
with people who find their mail is being marked as spam; there's not much they
can do about being in the same /8 as a bad spammer.

I'd prefer to "lock them down" at low values.

We could wait and see what Daryl's rescoring code makes of it... although that
doesn't seem to be running at the moment.

--j.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

Warren Togami <wt...@redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wtogami@redhat.com

--- Comment #17 from Warren Togami <wt...@redhat.com> 2009-09-23 21:56:41 PDT ---
(In reply to comment #16)
> Ultimately, isn't this just a really tiny DNSBL?  That might be more
> appropriate than a rule that must be updated very frequently.

OK, a tiny DNSBL is a bad idea.  No sense doing yet another network query for
every message for such a tiny list.  Far more efficient to sync the list on a
daily basis.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #16 from Warren Togami <wt...@redhat.com>  2009-08-27 07:57:33 PST ---
Ultimately, isn't this just a really tiny DNSBL?  That might be more
appropriate than a rule that must be updated very frequently.

Also, how does the source of this data feel about us copying it?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

AXB <ax...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |---

--- Comment #22 from AXB <ax...@gmail.com> ---
I see 20_khop_sc_bug_6114.cf updating on a  daily basis - all set to nopublish.

What's the point?

Can we loose this sort of stuff which only uses cycles/time during masschecks?
or should we locally exclude them from masschecks?

>From what I see most of the 20_khop_* rules have nopublish in them.
seems like an awfull waste of resources.
testing like 20_chickenpox.cf seems pointless.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #14 from Warren Togami <wt...@redhat.com>  2009-08-27 06:56:11 PST ---
> Keep in mind that this is using data that is 57 days old (May 19, new version
> attached) for a data set that is very time-specific.  You can see this impact
> in the hit-rate over time graph, best illustrated by KHOP_SC_TOP_CIDR8, 
> http://tinyurl.com/ksc3wa  (that's a shot of what it looks like now) - there
> were almost zero hams on May 19, but the hams spiked up a week later and again
> for this week.  Who's to say that the problematic entries were present at those
> times?  We know only that the ham count was best on the day it was released.

Do we have a means to automatically update these rules on a regular basis?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

--- Comment #19 from Warren Togami <wt...@redhat.com> 2009-09-24 07:57:38 PDT ---
(In reply to comment #18)
> As to hosting sc-neighbors as a DNSBL ... it's something I've considered (I
> even have a somewhat complete implementation of it), but Warren is right on the
> money when he concludes that the list is so damn small that it doesn't make any
> sense as a DNSBL.  My only additional thought is that I could rig it as a DNSBL
> solely for the purpose of not needing to update the rules in SVN, but I'd be
> much happier syncing the static rules.

Perhaps we should consider a different way to handle SOUGHT and top200 on a
daily basis.  It is a bit noisy to have daily svn commits, especially if we add
more daily rule channels.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

Adam Katz <ap...@khopis.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #4485|0                           |1
        is obsolete|                            |

--- Comment #18 from Adam Katz <ap...@khopis.com> 2009-09-24 07:50:25 PDT ---
Created an attachment (id=4540)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4540)
khop-sc-neighbors channel SA config from 2009-09-24 10a EDT

Current version attached, including move to optimized regexps from a while ago.

Justin:  can you point me to some docs regarding the SVN rules, directions, and
responsibilities (i.e. how to get the account, what I should and shouldn't do
with the account, what I can and cannot submit)?

Warren:  as discussed in IRC, I tried asking spamcop for explicit permission
but got no response.  This means they know about it and know that it's live and
in use, and presumably they'd have objected if it were frowned upon.

Further justification:  the SpamCop Blocking List (SCBL, bl.spamcop.net) is
fully free for all to use as per
<http://www.spamcop.net/fom-serve/cache/299.html> and they publish their top
offending network lists in a parser-friendly format, which wouldn't make any
sense if the data was supposed to be protected or private.  Also, the data
pulled by my script is negligible at a few KB a day (I use more of their
bandwidth with my spam reports!), pulled only by my server, which then serves
the data to the world, so there is no bandwidth issue.

As to hosting sc-neighbors as a DNSBL ... it's something I've considered (I
even have a somewhat complete implementation of it), but Warren is right on the
money when he concludes that the list is so damn small that it doesn't make any
sense as a DNSBL.  My only additional thought is that I could rig it as a DNSBL
solely for the purpose of not needing to update the rules in SVN, but I'd be
much happier syncing the static rules.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

--- Comment #24 from AXB <ax...@gmail.com> ---
(In reply to comment #23)
> It makes no sense for this to be in nightly masschecks.
> 
> It belongs in a separate DNSBL or SOUGHT-like add-on sa-update channel.

It's been hanging around for years, doing nothing.
Can you pls remove the bloat?

I'd like to increase spam corpus size 
If I do, processing time of the weekly checkss so long that  it doesn not fit
in the timeslot.
Removing "no hit" rules "nopublish" bloat, etc would probably shorten masscheck
time.


> (I've been away for a while.  Is SOUGHT still updating?)
SOUGHT_FRAUD is ok - generic SOUGHT seems faulty, atm
JM has been contacted.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114


Adam Katz <ap...@khopis.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #4446|application/octet-stream    |text/plain
          mime type|                            |




-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Re: [Bug 6114] SpamCop top spammers and top spamming networks

Posted by Axb <ax...@gmail.com>.
On 03/06/2013 01:17 AM, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114
>
> --- Comment #27 from Adam Katz <an...@khopis.com> ---
> (In reply to comment #26)
>> (In reply to comment #24)
>>>
>>> I'd like to increase spam corpus size
>>> If I do, processing time of the weekly checkss so long that  it doesn not
>>> fit in the timeslot.
>>> Removing "no hit" rules "nopublish" bloat, etc would probably shorten
>>> masscheck time.
>>
>> Would there be any harm in excluding all non-network nopublish rules from
>> weekly?
>
> khop-sc-neighbors rules aren't actually network rules, and are intentionally
> scored far higher when network checks are disabled -- or when DNSEval isn't
> loaded.
>
> However, I think you were asking in general, in which case that discussion
> should be moved to the dev email list.

Moved off Bugzilla

could we loose the autogenerated KHOP_ISC_*  (20_isc_attackers.cf) rules?
They have zero hits.

My request is general.

As I see it, to run nopublish rules for a while may make sense while 
developing, check perfomance, etc but to keep them in that state for 
years shows lack of respect of other ppl's volunteered masscheck 
resources, no matter how optimized and lightweight rules/test may be.
And/or the sandbox owner has lost interest in the project and we should 
treat these rules as unmaintained and purge.

atm, sa-update is publishing updates regularly tho we could use more 
ham. (according to http://www.chaosreigns.com/dnswl/tot.svg, we're 
dangerously low)

comments? ideas? rants?

Axb









[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

--- Comment #27 from Adam Katz <an...@khopis.com> ---
(In reply to comment #26)
> (In reply to comment #24)
> > 
> > I'd like to increase spam corpus size 
> > If I do, processing time of the weekly checkss so long that  it doesn not
> > fit in the timeslot.
> > Removing "no hit" rules "nopublish" bloat, etc would probably shorten
> > masscheck time.
> 
> Would there be any harm in excluding all non-network nopublish rules from
> weekly?

khop-sc-neighbors rules aren't actually network rules, and are intentionally
scored far higher when network checks are disabled -- or when DNSEval isn't
loaded.

However, I think you were asking in general, in which case that discussion
should be moved to the dev email list.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #3 from Adam Katz <ap...@khopis.com>  2009-05-15 22:17:26 PST ---
(In reply to comment #2)
> (In reply to comment #0)
> > 70_sc_top200's main merit was [...]
> 
> Now that rings some bells. :)  Old rule-set, outdated, deprecated, not updated,
> blah blah blah.  All buried somewhere in last months list archives, or
> something.

The OpenProtect sa-update channel syndicates a number of dangerously outdated
SARE channels, specifically 70_sc_top200.  Lots of people still use it (and the
stale SARE channels) without realizing this issue.  Big problem.

> > SpamCop also publishes the top /8 and /24 CIDR networks by several metrics,
> > including spam volume.  I've created some rules that examine the top offending
> > networks at those two levels plus the top offending individual servers (/32
> > CIDRs) at several thresholds.
> 
> They do? Nice.  I do recall having some brief look, though didn't find that.
> Any pointers, where that is?

/8:  http://spamcop.net/w3m?action=map;net=0;sort=spamcnt
/24: http://spamcop.net/w3m?action=map;net=cmaxcnt;mask=65535;sort=spamcnt
/32: http://www.spamcop.net/w3m?action=hoshame

The first two have tsv links at the bottom.
I scrape the third with "links -dump $url"


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

--- Comment #23 from Warren Togami <wt...@gmail.com> ---
It makes no sense for this to be in nightly masschecks.

It belongs in a separate DNSBL or SOUGHT-like add-on sa-update channel.

(I've been away for a while.  Is SOUGHT still updating?)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

Adam Katz <an...@khopis.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #21 from Adam Katz <an...@khopis.com> 2010-03-30 21:30:11 UTC ---
This bug was mostly about getting the rules into a sandbox for testing.  Since
I now have svn access and my server now syncs to svn nightly, this is a
resolved issue.

Discussion continues in bug 6390 with respect to scoring and what to do next.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #6 from Justin Mason <jm...@jmason.org>  2009-05-20 01:58:49 PST ---
(In reply to comment #5)
> Created an attachment (id=4448)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4448) [details]
> khop-sc-neighbors channel SA config from 2009-05-19 1:45p EDT
> 
> Ooops, problem!  My *_TOP_CIDR* rules were actually the bottom.  That's what I
> get for refactoring the generating code and not paying enough attention.  The
> channel is now fixed, and my recent discovery of the /16 CIDR has been added as
> well.

thanks for the update.  checked in:

: 42...; svn commit -m "bug 6114: update Adam's ruleset"
../../rulesrc/sandbox/jm/20_khop_sc_bug_6114.cf 
Sending        rulesrc/sandbox/jm/20_khop_sc_bug_6114.cf
Transmitting file data .
Committed revision 776624 ( https://svn.apache.org/viewcvs.cgi?view=rev&rev=776624 ).


keep an eye on ruleqa.spamassassin.org over the next few days for results....


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

--- Comment #26 from Warren Togami <wt...@gmail.com> ---
(In reply to comment #24)
> 
> I'd like to increase spam corpus size 
> If I do, processing time of the weekly checkss so long that  it doesn not
> fit in the timeslot.
> Removing "no hit" rules "nopublish" bloat, etc would probably shorten
> masscheck time.

Would there be any harm in excluding all non-network nopublish rules from
weekly?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #7 from Justin Mason <jm...@jmason.org>  2009-07-15 14:16:11 PST ---
MSECS      SPAM%     HAM%     S/O    RANK   SCORE  NAME WHO/AGE

0.00000   0.8412   0.0000   1.000    0.85    1.00  KHOP_SC_TOP_CIDR16  
0.00000  22.1685   0.5394   0.976    0.75    1.00  KHOP_SC_TOP_CIDR8  
0.00000   0.0000   0.0000   0.500    0.49    0.01  T_KHOP_SC_TOP_CIDR24  

KHOP_SC_TOP_CIDR16 looks good!

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #11 from Adam Katz <ap...@khopis.com>  2009-07-16 16:13:44 PST ---
Regexp::Assemble looks like the more interesting of the two, even if it's
easier for me to split() the regexp into pieces and then add() them to the RA
object.  Sure enough, it was a quick edit (only 8 lines of code, and I think
the resulting code is cleaner anyway).  My main worry is that the optimization
is more for regexp size than for performance.

I've also merged TOP10+TOP20+TOP100+TOP200 into TOP200, which makes its
definition 2751 characters with a slew of nesting after reduction via
Regexp::Assemble, which is a thousand more than when it was just a list of
SpamCop's 101-200 top offenders.

I'm going to sit on it for a few days before pushing it here just in case it
doesn't work well (though it's live on my sa-update channel).

Any comments on my conclusions when I said this?
> Additionally, recall that I assigned a very small number of points to the
> CIDR8 rules as I was fully expecting some FPs.  I've even scored them a
> little lower just in case, clocking in at 0.6 for TOP_CIDR8 and 0.2 for
> CIDR8.  Perhaps I'm not reading the score-map right, but 95.77% of the ham
> hits scored under 3.999 (84.14% scored under 0.999), so a small bump won't
> make a difference.  Given the current data, T_KHOP_SC_CIDR8 would only add
> points to ONE false positive hit (0.21% of the ham) and even if scored at
> 2.0, it would create 23 FPs (4.87% of the 0.8152% of the hams, which is to
> say 0.0397% of the ham).  Scoring it 1.0 or less wouldn't actually have
> added any FPs.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #10 from Henrik Krohns <he...@hege.li>  2009-07-16 02:39:20 PST ---
Not that it probably matters much, but should look at Regexp::Assemble since it
seems to be very professionally made, tested and still active.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #13 from Justin Mason <jm...@jmason.org>  2009-08-27 02:20:33 PST ---
Warren Togami to SpamAssassin (8 hours ago)

On 08/26/2009 04:31 AM, Rules Report Cron wrote:

    rulesrc/sandbox/jm/20_khop_sc_bug_6114.cf (10 rules, 10 bad):

      KHOP_SC_CIDR16:  no hits at all
      KHOP_SC_CIDR24:  no hits at all
      KHOP_SC_CIDR8:  no hits at all
      KHOP_SC_TOP10:  no hits at all
      KHOP_SC_TOP100:  no hits at all
      KHOP_SC_TOP20:  no hits at all
      KHOP_SC_TOP200:  no hits at all
      KHOP_SC_TOP_CIDR16:  no hits at all
      KHOP_SC_TOP_CIDR24:  no hits at all
      KHOP_SC_TOP_CIDR8:  no hits at all


khopesh in #spamassassin mentioned that these rules in the sandbox broke a few
weeks ago when the sandbox moved.  He hasn't had time to follow up.  I don't
know the details myself.

Warren

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

Adam Katz <ap...@khopis.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |apache@khopis.com

--- Comment #20 from Adam Katz <ap...@khopis.com> 2009-09-24 09:52:02 PDT ---
(In reply to comment #19)
> Perhaps we should consider a different way to handle SOUGHT and top200 on a
> daily basis.  It is a bit noisy to have daily svn commits, especially if we add
> more daily rule channels.

Many of the elements of sc-neighbors change a bit each time (rather than just
top200, though that is likely the most dynamic of them), and the channel is
updated every few hours rather than every day.

One possible solution would be to have a separate SVN repository for the
automated lists.  That repository could be completely separate, or it could
sync with the main repo whenever somebody commits something by hand.  I'm also
willing to part with the generator's code if there's a better host (e.g. an SA
server tasked to this sort of thing).  I have an older version published on my
website, but I never cleaned up after adding the DNSBL crap to it, so it's a
mess at the moment.


(In reply to some talk on IRC, 2009-09-24 01:04 EDT)
> 01:04 < warren> khopesh: [24439] dbg: dns: query failed: 
>                 0.3.3.khop-sc-neighbors.sa.khopesh.com => NXDOMAIN
> 01:04 < warren> khopesh: your channels need to be updated for 3.3.0?

I consider it a common problem when channels just support ALL releases in one
blanket swoop.  It makes issues VERY hard to correct (especially given the lack
of expiration options -- now requested as bug 6210 -- which sc-neighbors SORELY
needs).  While this channel is almost certainly okay (though note the bottom
rules, which try to approximate DNSBL results if the implementation lacks
DNSEval ... bad things happen if SA changes its code to rename or replace
DNSEval), my other channels will remain available for 3.2.x only until an
official 3.3.0 Changelog hits.

> 12:36 < warren> khopesh: lots of people are using 3.3.0 already to validate
>                 it prior to release, and the official and SOUGHT channels
>                 already have 3.3.0 channels
> 12:37 < khopesh> is it in release candidacy?
> 12:37 < warren> pretty close, only need to rescore at this point

The re-score is part of the problem, since the bottom section of this channel
includes scores aimed to approximate the scores from RCVD_IN_BL_SPAMCOP_NET if
it is missing.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114





--- Comment #1 from Adam Katz <ap...@khopis.com>  2009-05-15 17:07:05 PST ---
Created an attachment (id=4446)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4446)
khop-sc-neighbors channel SA config from 2009-05-15 8p EDT


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6114] SpamCop top spammers and top spamming networks

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114

--- Comment #25 from Adam Katz <an...@khopis.com> ---
(In reply to comment #24)
> (In reply to comment #23)
> > It makes no sense for this to be in nightly masschecks.
> > 
> > It belongs in a separate DNSBL or SOUGHT-like add-on sa-update channel.
> 
> It's been hanging around for years, doing nothing.
> Can you pls remove the bloat?
> 
> I'd like to increase spam corpus size 
> If I do, processing time of the weekly checkss so long that  it doesn not
> fit in the timeslot.
> Removing "no hit" rules "nopublish" bloat, etc would probably shorten
> masscheck time.
> 
> 
> > (I've been away for a while.  Is SOUGHT still updating?)
> SOUGHT_FRAUD is ok - generic SOUGHT seems faulty, atm
> JM has been contacted.

This *is* an sa-update channel, which is why it's marked nopublish in
subversion.  Its presence here merely help people see that it is vetted. 
Because these only run on one header, their performance should be quite fast. 
They also use constructed (optimized) regular expressions for performance
purposes (as indicated in earlier posts on this very bug).

I set this up in this manner because SA rule publication wasn't fast enough to
make use of updates at a feasible cadence.  Is this still true?  If we're
confident that SA can update fast (and reliably!) enough, I can remove the
nopublish tflag.

-- 
You are receiving this mail because:
You are the assignee for the bug.