You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2006/03/31 17:09:18 UTC

[Bug 4849] New: [PATCH] Reporting URLs that triggered rules

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4849

           Summary: [PATCH] Reporting URLs that triggered rules
           Product: Spamassassin
           Version: 2.63
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: spamassassin
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: nikclayton@gmail.com


When a URI matching rule fires, I need to know which URI in the message
triggered the match.

In theory, this is simple -- just have multiple URI rules, log the rule
name that fired, and then go from the rule name back to the URI.

In practice, this fails to scale. I have ~ 9000 URI matching rules.
When loaded, Perl + SA takes up ~ 75MB of RSS, and scanning is
significantly slowed.

If I use Regexp::Assemble to turn these URI regexps in to one great big
(70KB!) regexp memory usage drops to around 36MB, and scanning time is
halved.

But if I do this there's only one URI rule. So when management say "Why
was this message blocked?" I can no longer tell them the details they've
grown accustomed to hearing (i.e., which URL triggered the block). This
also makes it difficult to track down issues where a URL block is
overzealous, and is blocking legitimate messages.

So... here is a patch that stores the names of each URI that's found
by a rule, and provides a get_names_of_uris_hit() method to return this
information. I can then use this to log the URLs that triggered my
single URI matching rule. Best of both worlds.

The patch is against 2.63 I'm afraid, since that's what I have handy.
Forward porting to 3.1.1 should be trivial.

Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
===================================================================
--- lib/Mail/SpamAssassin/PerMsgStatus.pm (revision 14687)
+++ lib/Mail/SpamAssassin/PerMsgStatus.pm (working copy)
@@ -68,6 +68,7 @@
'test_logs' => '',
'test_names_hit' => [ ],
'subtest_names_hit' => [ ],
+ 'uris_hit' => [ ],
'tests_already_hit' => { },
'hdr_cache' => { },
'rule_errors' => 0,
@@ -364,6 +365,21 @@

###########################################################################

+=item @list = $status->get_names_of_uris_hit ()
+
+After a mail message has been checked, this method can be called. It will
+return a list of all the URIs that were hit by rules.
+
+=cut
+
+sub get_names_of_uris_hit {
+ my ($self) = @_;
+
+ return @{$self->{uris_hit}};
+}
+
+###########################################################################
+
=item $list = $status->get_names_of_subtests_hit ()

After a mail message has been checked, this method can be called. It will
@@ -1836,7 +1852,7 @@
foreach ( @_ ) {
'.$self->hash_line_for_rule($rulename).'
if ('.$pat.') {
- $self->got_uri_pattern_hit (q{'.$rulename.'});
+ $self->got_uri_pattern_hit (q{'.$rulename.'}, $_);
'. $self->ran_rule_debug_code ($rulename,"uri test", 4) . '
}
}
@@ -2315,12 +2331,13 @@
}

sub got_uri_pattern_hit {
- my ($self, $rulename) = @_;
+ my ($self, $rulename, $uri) = @_;

# only allow each test to hit once per mail
# TODO: Move this into the rule matcher
return if (defined $self->{tests_already_hit}->{$rulename});

+ push @{$self->{uris_hit}}, $uri;
$self->got_hit ($rulename, 'URI: ');
}



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.