You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2012/10/23 21:11:41 UTC

[Bug 6854] New: Optimizations, profiling

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6854

          Priority: P2
            Bug ID: 6854
          Assignee: dev@spamassassin.apache.org
           Summary: Optimizations, profiling
          Severity: enhancement
    Classification: Unclassified
                OS: All
          Reporter: Mark.Martinec@ijs.si
          Hardware: All
            Status: NEW
           Version: 3.4 SVN branch
         Component: Libraries
           Product: Spamassassin

Created attachment 5102
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5102&action=edit
The low-hanging fruit

Spent a day with a NYTProf 4.08 Perl profiler trying to cut down
some of the inefficiencies of SpamAssassin dealing with large mail
messages (which are usually large thanks to some Base64-encoded
attachments). Using Perl 5.16 on a FreeBSD 9.1 platform.

Picking just the low-hanging fruit with most outstanding hotspots in
each iteration, I managed to shave off about 100 ms of CPU-intensive
hotspots (local tests only) in a command-line spamassassin run
(with a 3 MB message containing a large PDF).

Depending on what is being measured (like aggregate mail throughput),
and what proportion of large messages are being passed to SpamAssassin
(like passing only the first 420 kB from amavisd to SpamAssassin),
this amounts to between 3 and 7 % of a speedup for large messages.
Not too bad where every bit adds up.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6854] Optimizations, profiling

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6854

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED
   Target Milestone|Undefined                   |3.4.0

--- Comment #4 from Mark Martinec <Ma...@ijs.si> ---
Ok, that's it for now, more profiling some time in the future...

-- 
You are receiving this mail because:
You are the assignee for the bug.

Re: [Bug 6854] Optimizations, profiling

Posted by Mark Martinec <Ma...@ijs.si>.

Axb,

> Such a sample doesn't convince me (Yet) as it doesn't show potential FPs
> due scans on raw encoded attachments after 4 lines of txt/html as well
> as timing per body rule type. 
> Could you let me have this sample corpus to compare results with
> spamc/spamd under different conditions?

(answered offline)

  Mark

Re: [Bug 6854] Optimizations, profiling

Posted by Axb <ax...@gmail.com>.

On 10/24/2012 02:34 AM, Mark Martinec wrote:
> On Tuesday October 23 2012 22:26:00 Axb wrote:
>> Spamc/Spamd's "skip size" method  has made a huge *positive* difference
>> on FPs, and scan times.
>> The FNs wouldn't *ever* have been caught by a chunk method due to the
>> kind of content included "above" threshold.
>
> Out of curiosity, during the last 10 days our system detected
> almost 200 large spam messages (manually confirmed spam) with
> size above 400 kB (of which SpamAssassin saw only the first
> 420 kB, the rest was truncated).
>
> Of these there were 55 distinct species:
>   17 in the  400..500 kB region
>   16 in the  500..700 kB region
>    9 in the  700..1000 kB region
>   10 in the 1000..2000 kB region
>    2 of 2.8 MB
>    1 of 3.6 MB
>
> Median spam score (by species) for these was Q2=15.5,
> quartiles score Q1=11 and Q3=27, so I'd say SpamAssassin did
> a good job with these. The most valuable score contributions
> seems to have been a mail header section (subject, RBL, bayes),
> attachment contents was probably less important.

SA's default is 512kB, right? Many ppl raise that to close to 1MB
After that, how much of your checked corpus would have survived RBL 
rejects at MTA level?

Such a sample doesn't convince me (Yet) as it doesn't show potential FPs 
due scans on raw encoded attachments after 4 lines of txt/html as well 
as timing per body rule type.

Could you let me have this sample corpus to compare results with 
spamc/spamd under different conditions?

Axb

Re: [Bug 6854] Optimizations, profiling

Posted by Mark Martinec <Ma...@ijs.si>.

On Tuesday October 23 2012 22:26:00 Axb wrote:
> Spamc/Spamd's "skip size" method  has made a huge *positive* difference
> on FPs, and scan times.
> The FNs wouldn't *ever* have been caught by a chunk method due to the
> kind of content included "above" threshold.

Out of curiosity, during the last 10 days our system detected
almost 200 large spam messages (manually confirmed spam) with
size above 400 kB (of which SpamAssassin saw only the first
420 kB, the rest was truncated).

Of these there were 55 distinct species:
 17 in the  400..500 kB region
 16 in the  500..700 kB region
  9 in the  700..1000 kB region
 10 in the 1000..2000 kB region
  2 of 2.8 MB
  1 of 3.6 MB

Median spam score (by species) for these was Q2=15.5,
quartiles score Q1=11 and Q3=27, so I'd say SpamAssassin did
a good job with these. The most valuable score contributions
seems to have been a mail header section (subject, RBL, bayes),
attachment contents was probably less important.

  Mark

Re: [Bug 6854] Optimizations, profiling

Posted by Axb <ax...@gmail.com>.

On 10/23/2012 10:15 PM, Kevin A. McGrail wrote:> On 10/23/2012 4:10 PM, 
Axb wrote:
 >> On 10/23/2012 09:59 PM, Kevin A. McGrail wrote:
 >>> On 10/23/2012 3:48 PM, bugzilla-daemon@issues.apache.org wrote:
 >>>> A message larger than a certain configured size is truncated
 >>>> at the configured size and that is what SpamAssassin sees.
 >>>> No other contents processing in this data path, just
 >>>> blunt truncation of the raw mail message. Works quite well,
 >>>> certainly much better than not scanning large messages at all.
 >>> Makes sense to me.  Something we should consider for SA to do by 
default
 >>> with spamc/spamd?
 >>
 >> why? wasn't spamassassin designed to ignore attachements instead of
 >> what MailScanner and Amavisd are doing using the API?
 >>
 >> Why should SA spend time scanning "binary" content it cannot decode or
 >> extract anything useful to apply rules?
 >>
 >> I consider the "chunk" method the worse way to do it as it may skip
 >> txt/html content which could show up after the configured chunk size
 >> while spending lots of cycles  scanning a two liner with an attached
 >> 400kb PDF/workd/etc attachement.
 >> or did it get it all wrong?
 > It's very synergistic because I wrote a note about this 2 days ago.
 >
 > With SpamC/SpamD, if the email (encoded) is larger than X size, the
 > email is not scanned at all.

make it switchable?

scan_chunck  yes
scan_chunk_size 400kb

but as a global method change, no thanks.

 > My thoughts were to ignore any binary attachments.  I even considered
 > writing a glue that would rewrite a temporary copy of the email to
 > remove binary attachments and see if THAT met the threshold.  But if
 > real-world experience with simply chopping works well, who am I to
 > complain?

I stopped using MailScanner exactly due to that reason: lots of plugins 
and rules go nuts over encoded attachements.

Spamc/Spamd's "skip size" method  has made a huge *positive* difference 
on FPs, and scan times.

The FNs wouldn't *ever* have been caught by a chunk method due to the 
kind of content included "above" threshold.

Re: [Bug 6854] Optimizations, profiling

Posted by Axb <ax...@gmail.com>.

On 10/23/2012 11:29 PM, John Hardin wrote:
> On Tue, 23 Oct 2012, Axb wrote:
>
>> On 10/23/2012 10:48 PM, John Hardin wrote:
>>>  On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
>>>
>>> >  My thoughts were to ignore any binary attachments.
>>>
>>>  I don't think that's justified. I'm beginning to see a resurgence of
>>>  image spams that the OCR plugin would probably catch. Plus I fairly
>>>  regularly see 419 spams with the body of the pitch in a PDF or MS Word
>>>  document attachment.
>>
>> SA never scanned binary attachements and the chunk method wouldn't
>> change that, just apply rules to content for which it was not designed
>> for.
>>
>> PDF/Word attachments need to be detected by checksum or other newer
>> methods, but definitely not by the existing rule methods.
>> You won't get anything useful with a raw/body rule or any other regex
>> scanner out of an encoded chunk of an attachment.
>
> I'm not suggesting you would.
>
>> Stuff like PDFinfo, Imageinfo, etc are the kind of plugis required to
>> do foo against attachements.
>
> That's my point. If we strip binary attachments, what would PDFinfo,
> Imageinfo, FuzzyOCR et. al. have to work with?
>
> Or am I misunderstanding and this stripping is occurring internally to
> SA and affects what the RE rules scan? If so, I apologize, I was
> assuming the context was spamc or something else client-side doing the
> strip/ignore and SA never getting the attachments in the first place...

iirc, SA gets the attachments, just doesn't parse rules against them yet 
permits plugins handle the attachments.
This allows stuff like attachment hashers, OCR scanners, etc, etc handle 
the attachments

The raw chunck method can also break this if SA only sees part of the 
attachment due to a configure chunk limit. (been there)

Re: [Bug 6854] Optimizations, profiling

Posted by John Hardin <jh...@impsec.org>.

On Tue, 23 Oct 2012, Axb wrote:

> On 10/23/2012 10:48 PM, John Hardin wrote:
>>  On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
>> 
>> >  My thoughts were to ignore any binary attachments.
>>
>>  I don't think that's justified. I'm beginning to see a resurgence of
>>  image spams that the OCR plugin would probably catch. Plus I fairly
>>  regularly see 419 spams with the body of the pitch in a PDF or MS Word
>>  document attachment.
>
> SA never scanned binary attachements and the chunk method wouldn't change 
> that, just apply rules to content for which it was not designed for.
>
> PDF/Word attachments need to be detected by checksum or other newer methods, 
> but definitely not by the existing rule methods.
> You won't get anything useful with a raw/body rule or any other regex scanner 
> out of an encoded chunk of an attachment.

I'm not suggesting you would.

> Stuff like PDFinfo, Imageinfo, etc are the kind of plugis required to do foo 
> against attachements.

That's my point. If we strip binary attachments, what would PDFinfo, 
Imageinfo, FuzzyOCR et. al. have to work with?

Or am I misunderstanding and this stripping is occurring internally to SA 
and affects what the RE rules scan? If so, I apologize, I was assuming the 
context was spamc or something else client-side doing the strip/ignore and 
SA never getting the attachments in the first place...

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   One death is a tragedy; thirty is a media sensation;
   a million is a statistic.              -- Joseph Stalin, modernized
-----------------------------------------------------------------------
  145 days since the first successful private support mission to ISS (SpaceX)

Re: [Bug 6854] Optimizations, profiling

Posted by Henrik Krohns <he...@hege.li>.

On Tue, Oct 23, 2012 at 11:02:37PM +0200, Axb wrote:
> On 10/23/2012 10:48 PM, John Hardin wrote:
> >On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
> >
> >>My thoughts were to ignore any binary attachments.
> >
> >I don't think that's justified. I'm beginning to see a resurgence of
> >image spams that the OCR plugin would probably catch. Plus I fairly
> >regularly see 419 spams with the body of the pitch in a PDF or MS Word
> >document attachment.
> 
> SA never scanned binary attachements and the chunk method wouldn't
> change that, just apply rules to content for which it was not
> designed for.

Just as a reminder for everyone:

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6582

The problem here is SA stumbling onto large masses of data that it believes
to be "text", thus running all body rules etc on that.  If we fix or limit
the impact of that, there's no reason to have any kind of silly "skip or
chop large message" kludges.

You'd only want to skip large messages completely if you are very low on
resources and can't spare cpu or few megs of memory to keep the parsed
message blobs in memory.  Ok, there probably isn't much spam in the 10MB
range anymore, so you might skip that..

Re: [Bug 6854] Optimizations, profiling

Posted by Axb <ax...@gmail.com>.

On 10/23/2012 10:48 PM, John Hardin wrote:
> On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
>
>> My thoughts were to ignore any binary attachments.
>
> I don't think that's justified. I'm beginning to see a resurgence of
> image spams that the OCR plugin would probably catch. Plus I fairly
> regularly see 419 spams with the body of the pitch in a PDF or MS Word
> document attachment.

SA never scanned binary attachements and the chunk method wouldn't 
change that, just apply rules to content for which it was not designed for.

PDF/Word attachments need to be detected by checksum or other newer 
methods, but definitely not by the existing rule methods.
You won't get anything useful with a raw/body rule or any other regex 
scanner out of an encoded chunk of an attachment.

Stuff like PDFinfo, Imageinfo, etc are the kind of plugis required to do 
foo against attachements.

Re: [Bug 6854] Optimizations, profiling

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 10/23/2012 4:48 PM, John Hardin wrote:
> On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
>
>> My thoughts were to ignore any binary attachments.
>
> I don't think that's justified. I'm beginning to see a resurgence of 
> image spams that the OCR plugin would probably catch. Plus I fairly 
> regularly see 419 spams with the body of the pitch in a PDF or MS Word 
> document attachment.

Sorry, that's not QUITE what I meant.  If an email is over the limit, 
ignore the binary attachments.

Re: [Bug 6854] Optimizations, profiling

Posted by John Hardin <jh...@impsec.org>.

On Tue, 23 Oct 2012, Kevin A. McGrail wrote:

> My thoughts were to ignore any binary attachments.

I don't think that's justified. I'm beginning to see a resurgence of image 
spams that the OCR plugin would probably catch. Plus I fairly regularly 
see 419 spams with the body of the pitch in a PDF or MS Word document 
attachment.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   My sidearm is a piece of emergency equipment. It absolutely must
   be reliable, not "smart".
-----------------------------------------------------------------------
  145 days since the first successful private support mission to ISS (SpaceX)

Re: [Bug 6854] Optimizations, profiling

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 10/23/2012 4:10 PM, Axb wrote:
> On 10/23/2012 09:59 PM, Kevin A. McGrail wrote:
>> On 10/23/2012 3:48 PM, bugzilla-daemon@issues.apache.org wrote:
>>> A message larger than a certain configured size is truncated
>>> at the configured size and that is what SpamAssassin sees.
>>> No other contents processing in this data path, just
>>> blunt truncation of the raw mail message. Works quite well,
>>> certainly much better than not scanning large messages at all.
>> Makes sense to me.  Something we should consider for SA to do by default
>> with spamc/spamd?
>
> why? wasn't spamassassin designed to ignore attachements instead of 
> what MailScanner and Amavisd are doing using the API?
>
> Why should SA spend time scanning "binary" content it cannot decode or 
> extract anything useful to apply rules?
>
> I consider the "chunk" method the worse way to do it as it may skip 
> txt/html content which could show up after the configured chunk size 
> while spending lots of cycles  scanning a two liner with an attached 
> 400kb PDF/workd/etc attachement.
> or did it get it all wrong? 
It's very synergistic because I wrote a note about this 2 days ago.

With SpamC/SpamD, if the email (encoded) is larger than X size, the 
email is not scanned at all.

My thoughts were to ignore any binary attachments.  I even considered 
writing a glue that would rewrite a temporary copy of the email to 
remove binary attachments and see if THAT met the threshold.  But if 
real-world experience with simply chopping works well, who am I to complain?

regards,
KAM

Re: [Bug 6854] Optimizations, profiling

Posted by Axb <ax...@gmail.com>.

On 10/23/2012 09:59 PM, Kevin A. McGrail wrote:
> On 10/23/2012 3:48 PM, bugzilla-daemon@issues.apache.org wrote:
>> A message larger than a certain configured size is truncated
>> at the configured size and that is what SpamAssassin sees.
>> No other contents processing in this data path, just
>> blunt truncation of the raw mail message. Works quite well,
>> certainly much better than not scanning large messages at all.
> Makes sense to me.  Something we should consider for SA to do by default
> with spamc/spamd?

why? wasn't spamassassin designed to ignore attachements instead of what 
MailScanner and Amavisd are doing using the API?

Why should SA spend time scanning "binary" content it cannot decode or 
extract anything useful to apply rules?

I consider the "chunk" method the worse way to do it as it may skip 
txt/html content which could show up after the configured chunk size 
while spending lots of cycles  scanning a two liner with an attached 
400kb PDF/workd/etc attachement.
or did it get it all wrong?

Axb

Re: [Bug 6854] Optimizations, profiling

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 10/23/2012 3:48 PM, bugzilla-daemon@issues.apache.org wrote:
> A message larger than a certain configured size is truncated
> at the configured size and that is what SpamAssassin sees.
> No other contents processing in this data path, just
> blunt truncation of the raw mail message. Works quite well,
> certainly much better than not scanning large messages at all.
Makes sense to me.  Something we should consider for SA to do by default 
with spamc/spamd?

regards,
KAM

[Bug 6854] Optimizations, profiling

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6854

--- Comment #3 from Mark Martinec <Ma...@ijs.si> ---
(> I'm not sure I understand:
> Does Amavisd send chuncks of raw message to SA instead of only the txt/html
> parts and leave "attachments" unscanned?

A message larger than a certain configured size is truncated
at the configured size and that is what SpamAssassin sees.
No other contents processing in this data path, just
blunt truncation of the raw mail message. Works quite well,
certainly much better than not scanning large messages at all.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6854] Optimizations, profiling

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6854

--- Comment #2 from AXB <ax...@gmail.com> ---
(In reply to comment #0)
> Created attachment 5102 [details]
> The low-hanging fruit
> 
> Spent a day with a NYTProf 4.08 Perl profiler trying to cut down
> some of the inefficiencies of SpamAssassin dealing with large mail
> messages (which are usually large thanks to some Base64-encoded
> attachments). Using Perl 5.16 on a FreeBSD 9.1 platform.
> 
> Picking just the low-hanging fruit with most outstanding hotspots in
> each iteration, I managed to shave off about 100 ms of CPU-intensive
> hotspots (local tests only) in a command-line spamassassin run
> (with a 3 MB message containing a large PDF).
> 
> Depending on what is being measured (like aggregate mail throughput),
> and what proportion of large messages are being passed to SpamAssassin
> (like passing only the first 420 kB from amavisd to SpamAssassin),
> this amounts to between 3 and 7 % of a speedup for large messages.
> Not too bad where every bit adds up.

I'm not sure I understand:
Does Amavisd send chuncks of raw message to SA instead of only the txt/html
parts and leave "attachments" unscanned?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6854] Optimizations, profiling

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6854

--- Comment #1 from Mark Martinec <Ma...@ijs.si> ---
trunk:
$ svn ci -m 'Bug 6854: Optimizations, profiling'      
  Sending lib/Mail/SpamAssassin/Conf/Parser.pm
  Sending lib/Mail/SpamAssassin/Message.pm
  Sending lib/Mail/SpamAssassin/Plugin/MIMEEval.pm
  Sending lib/Mail/SpamAssassin/Plugin/VBounce.pm
  Sending lib/Mail/SpamAssassin/Util.pm
Committed revision 1401393.

-- 
You are receiving this mail because:
You are the assignee for the bug.