You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2012/10/24 16:56:03 UTC

[Bug 6855] New: Rules with extremely poor CPU_time / mail_size ratio

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

          Priority: P2
            Bug ID: 6855
          Assignee: dev@spamassassin.apache.org
           Summary: Rules with extremely poor CPU_time / mail_size ratio
          Severity: normal
    Classification: Unclassified
                OS: All
          Reporter: Mark.Martinec@ijs.si
          Hardware: All
            Status: NEW
           Version: 3.4 SVN branch
         Component: Rules
           Product: Spamassassin

Here is another problem report referring to long SpamAssassin scan times
with larger mail. Unlike Bug 6854, which concentrates on code hotspots,
this one refers to problematic rules (and/or their implementation).
There may be other similar PRs elsewhere.

Attached is a png image of a plot. The scattergram shows dots for
each mail message being processed on our system for the last two days.
The x-axis is a mail size in bytes, the y-axis is a CPU usage time
in milliseconds used only by SpamAssassin, as reported by a getrusage(2)
system call (via Unix-Getrusage-0.03 perl module).

Initialization and startup times are excluded from the measurements,
and so are mail transfer times. Measuring the CPU time instead of
elapsed time also largely removes the effect of network rules.

What caught my eye is that the ratio of CPU time vs. mail size
does not fall on a narrow line, but forms a rather wide wedge.
While most samples fall on the lower side of the wedge (showing
a slope of about 355 kB/s with a fixed offset of about 0.4 s),
some samples fall near the upper side of the wedge, showing
an extremely wasteful linear slope of about 11.5 kB/s.
These samples are shown in blue on the diagram.

I managed to get hold of a couple of such extreme messages
which fell near the upper edge of the wedge. Luckily some
were addressed to me - like a SA cron job as sent to
"nitemc@spamassassin2.zones.apache.org" and a
"bad sandbox rules report". Similar were a couple of in-house
cron jobs or rsync reports. But not all extreme samples were
the result of automatic runs, there were also legitimate
googlegroups chats with long thread citations and some reports
to authors from a publishing company. Will attach two or three
of these.

Note these were not extremely large messages, here are sample
sizes of four of such cases: 187kB, 154kB, 73kB, 93kB.
Profiling these four with a NYTProf profiler in a command-line
spamassassin run shows similar results as can be obtained
through a HitFreqsRuleTiming.pm plugin.

Will attach the results. Here is a summary - a top few of
the most "CPU resource usage per mail size" wasteful rules
for each of these four (quite different) messages:

785ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
777ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
299ms  72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
216ms  72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
128ms  72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
126ms  72_active.cf, rule __FILL_THIS_FORM_LONG1, 
59.5ms 72_active.cf, rule __LOTSA_MONEY_03,

645ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
641ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
179ms  72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
145ms  72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
141ms  72_active.cf, rule __FILL_THIS_FORM_LONG1, 
110ms  72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
48.6ms 72_active.cf, rule __LOTSA_MONEY_03, 

219ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
212ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
84.2ms 72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
62.6ms 72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
28.6ms 72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
27.7ms 72_active.cf, rule __FILL_THIS_FORM_LONG1, 
15.7ms 72_active.cf, rule __LOTSA_MONEY_03, 

236ms  72_active.cf, rule __FILL_THIS_FORM_SHORT2, 
225ms  72_active.cf, rule __FILL_THIS_FORM_LONG2, 
89.9ms 72_active.cf, rule __FILL_THIS_FORM_FRAUD_PHISH1, 
62.9ms 72_active.cf, rule __FILL_THIS_FORM_LOAN1, 
34.9ms 72_active.cf, rule __FILL_THIS_FORM_LONG1, 
34.5ms 72_active.cf, rule __FILL_THIS_FORM_SHORT1, 
18.8ms 72_active.cf, rule T_LOTTO_URI, 
18.4ms 72_active.cf, rule __GAPPY_HTML_02, 
16.8ms 72_active.cf, rule __LOTSA_MONEY_03, 
15.9ms 72_active.cf, rule __LOTSA_MONEY_01, 

I'm not sure what can/should be done about it,
but people tell me it is not acceptable that it takes
30 seconds of intensive CPU spinning to process a 350 kB message
(on our AMD Opteron, not the very latest and greatest,
but still a decent hardware).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

--- Comment #5 from AXB <ax...@gmail.com> ---
Assuming you tested this with Amavisd, could you run the same test using
spamc/spamd and the same ruleset?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

--- Comment #6 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5107
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5107&action=edit
Profiling report - CPU time per rule, as reported by NYTProf for four sample
messages

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

--- Comment #9 from Mark Martinec <Ma...@ijs.si> ---
> > could you run the same test using spamc/spamd and the same ruleset?

Forgot to add: the ruleset in use what what came out freshly
from sa-update for 3.4.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

John Hardin <jh...@impsec.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jhardin@impsec.org

--- Comment #11 from John Hardin <jh...@impsec.org> ---
(In reply to comment #4)
> Created attachment 5106 [details]
> Mail sample 3

Thanks, I will take a look at those as soon as I can and see if I can improve
the rules.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

--- Comment #7 from Mark Martinec <Ma...@ijs.si> ---
> These samples are shown in blue on the diagram.

In red, sorry. The rest of the samples is in green.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

--- Comment #10 from Kevin A. McGrail <km...@pccc.com> ---
This is just great data.  And I say that because none of my rules were flagged
;-)

Can we add this type of profiling to a test procedure for rules that is added
to rulesqa website?  Something to consider with S/O as a ratio of S/O to
resources required?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

--- Comment #1 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5103
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5103&action=edit
The diagram: CPU usage vs. mail size

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

--- Comment #4 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5106
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5106&action=edit
Mail sample 3

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

--- Comment #8 from Mark Martinec <Ma...@ijs.si> ---
> Assuming you tested this with Amavisd,

Indeed. The dot-samples in the diagram were taken from its log.

> could you run the same test using spamc/spamd and the same ruleset?

The samples in the red category which I could get hold of
were re-measured in a command-line spamassassin (ignoring the
startup times), with similar results. I believe this should suffice.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

--- Comment #3 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5105
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5105&action=edit
Mail sample 2

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6855] Rules with extremely poor CPU_time / mail_size ratio

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6855

--- Comment #2 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5104
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5104&action=edit
Mail sample 1

-- 
You are receiving this mail because:
You are the assignee for the bug.