You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2012/10/23 21:11:41 UTC
[Bug 6854] New: Optimizations, profiling
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6854
Priority: P2
Bug ID: 6854
Assignee: dev@spamassassin.apache.org
Summary: Optimizations, profiling
Severity: enhancement
Classification: Unclassified
OS: All
Reporter: Mark.Martinec@ijs.si
Hardware: All
Status: NEW
Version: 3.4 SVN branch
Component: Libraries
Product: Spamassassin
Created attachment 5102
--> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5102&action=edit
The low-hanging fruit
Spent a day with a NYTProf 4.08 Perl profiler trying to cut down
some of the inefficiencies of SpamAssassin dealing with large mail
messages (which are usually large thanks to some Base64-encoded
attachments). Using Perl 5.16 on a FreeBSD 9.1 platform.
Picking just the low-hanging fruit with most outstanding hotspots in
each iteration, I managed to shave off about 100 ms of CPU-intensive
hotspots (local tests only) in a command-line spamassassin run
(with a 3 MB message containing a large PDF).
Depending on what is being measured (like aggregate mail throughput),
and what proportion of large messages are being passed to SpamAssassin
(like passing only the first 420 kB from amavisd to SpamAssassin),
this amounts to between 3 and 7 % of a speedup for large messages.
Not too bad where every bit adds up.
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 6854] Optimizations, profiling
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6854
Mark Martinec <Ma...@ijs.si> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |FIXED
Target Milestone|Undefined |3.4.0
--- Comment #4 from Mark Martinec <Ma...@ijs.si> ---
Ok, that's it for now, more profiling some time in the future...
--
You are receiving this mail because:
You are the assignee for the bug.
Re: [Bug 6854] Optimizations, profiling
Posted by Mark Martinec <Ma...@ijs.si>.
Axb,
> Such a sample doesn't convince me (Yet) as it doesn't show potential FPs
> due scans on raw encoded attachments after 4 lines of txt/html as well
> as timing per body rule type.
> Could you let me have this sample corpus to compare results with
> spamc/spamd under different conditions?
(answered offline)
Mark
Re: [Bug 6854] Optimizations, profiling
Posted by Axb <ax...@gmail.com>.
On 10/24/2012 02:34 AM, Mark Martinec wrote:
> On Tuesday October 23 2012 22:26:00 Axb wrote:
>> Spamc/Spamd's "skip size" method has made a huge *positive* difference
>> on FPs, and scan times.
>> The FNs wouldn't *ever* have been caught by a chunk method due to the
>> kind of content included "above" threshold.
>
> Out of curiosity, during the last 10 days our system detected
> almost 200 large spam messages (manually confirmed spam) with
> size above 400 kB (of which SpamAssassin saw only the first
> 420 kB, the rest was truncated).
>
> Of these there were 55 distinct species:
> 17 in the 400..500 kB region
> 16 in the 500..700 kB region
> 9 in the 700..1000 kB region
> 10 in the 1000..2000 kB region
> 2 of 2.8 MB
> 1 of 3.6 MB
>
> Median spam score (by species) for these was Q2=15.5,
> quartiles score Q1=11 and Q3=27, so I'd say SpamAssassin did
> a good job with these. The most valuable score contributions
> seems to have been a mail header section (subject, RBL, bayes),
> attachment contents was probably less important.
SA's default is 512kB, right? Many ppl raise that to close to 1MB
After that, how much of your checked corpus would have survived RBL
rejects at MTA level?
Such a sample doesn't convince me (Yet) as it doesn't show potential FPs
due scans on raw encoded attachments after 4 lines of txt/html as well
as timing per body rule type.
Could you let me have this sample corpus to compare results with
spamc/spamd under different conditions?
Axb
Re: [Bug 6854] Optimizations, profiling
Posted by Mark Martinec <Ma...@ijs.si>.
On Tuesday October 23 2012 22:26:00 Axb wrote:
> Spamc/Spamd's "skip size" method has made a huge *positive* difference
> on FPs, and scan times.
> The FNs wouldn't *ever* have been caught by a chunk method due to the
> kind of content included "above" threshold.
Out of curiosity, during the last 10 days our system detected
almost 200 large spam messages (manually confirmed spam) with
size above 400 kB (of which SpamAssassin saw only the first
420 kB, the rest was truncated).
Of these there were 55 distinct species:
17 in the 400..500 kB region
16 in the 500..700 kB region
9 in the 700..1000 kB region
10 in the 1000..2000 kB region
2 of 2.8 MB
1 of 3.6 MB
Median spam score (by species) for these was Q2=15.5,
quartiles score Q1=11 and Q3=27, so I'd say SpamAssassin did
a good job with these. The most valuable score contributions
seems to have been a mail header section (subject, RBL, bayes),
attachment contents was probably less important.
Mark
Re: [Bug 6854] Optimizations, profiling
Posted by Axb <ax...@gmail.com>.
On 10/23/2012 10:15 PM, Kevin A. McGrail wrote:> On 10/23/2012 4:10 PM,
Axb wrote:
>> On 10/23/2012 09:59 PM, Kevin A. McGrail wrote:
>>> On 10/23/2012 3:48 PM, bugzilla-daemon@issues.apache.org wrote:
>>>> A message larger than a certain configured size is truncated
>>>> at the configured size and that is what SpamAssassin sees.
>>>> No other contents processing in this data path, just
>>>> blunt truncation of the raw mail message. Works quite well,
>>>> certainly much better than not scanning large messages at all.
>>> Makes sense to me. Something we should consider for SA to do by
default
>>> with spamc/spamd?
>>
>> why? wasn't spamassassin designed to ignore attachements instead of
>> what MailScanner and Amavisd are doing using the API?
>>
>> Why should SA spend time scanning "binary" content it cannot decode or
>> extract anything useful to apply rules?
>>
>> I consider the "chunk" method the worse way to do it as it may skip
>> txt/html content which could show up after the configured chunk size
>> while spending lots of cycles scanning a two liner with an attached
>> 400kb PDF/workd/etc attachement.
>> or did it get it all wrong?
> It's very synergistic because I wrote a note about this 2 days ago.
>
> With SpamC/SpamD, if the email (encoded) is larger than X size, the
> email is not scanned at all.
make it switchable?
scan_chunck yes
scan_chunk_size 400kb
but as a global method change, no thanks.
> My thoughts were to ignore any binary attachments. I even considered
> writing a glue that would rewrite a temporary copy of the email to
> remove binary attachments and see if THAT met the threshold. But if
> real-world experience with simply chopping works well, who am I to
> complain?
I stopped using MailScanner exactly due to that reason: lots of plugins
and rules go nuts over encoded attachements.
Spamc/Spamd's "skip size" method has made a huge *positive* difference
on FPs, and scan times.
The FNs wouldn't *ever* have been caught by a chunk method due to the
kind of content included "above" threshold.
Re: [Bug 6854] Optimizations, profiling
Posted by Axb <ax...@gmail.com>.
On 10/23/2012 11:29 PM, John Hardin wrote:
> On Tue, 23 Oct 2012, Axb wrote:
>
>> On 10/23/2012 10:48 PM, John Hardin wrote:
>>> On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
>>>
>>> > My thoughts were to ignore any binary attachments.
>>>
>>> I don't think that's justified. I'm beginning to see a resurgence of
>>> image spams that the OCR plugin would probably catch. Plus I fairly
>>> regularly see 419 spams with the body of the pitch in a PDF or MS Word
>>> document attachment.
>>
>> SA never scanned binary attachements and the chunk method wouldn't
>> change that, just apply rules to content for which it was not designed
>> for.
>>
>> PDF/Word attachments need to be detected by checksum or other newer
>> methods, but definitely not by the existing rule methods.
>> You won't get anything useful with a raw/body rule or any other regex
>> scanner out of an encoded chunk of an attachment.
>
> I'm not suggesting you would.
>
>> Stuff like PDFinfo, Imageinfo, etc are the kind of plugis required to
>> do foo against attachements.
>
> That's my point. If we strip binary attachments, what would PDFinfo,
> Imageinfo, FuzzyOCR et. al. have to work with?
>
> Or am I misunderstanding and this stripping is occurring internally to
> SA and affects what the RE rules scan? If so, I apologize, I was
> assuming the context was spamc or something else client-side doing the
> strip/ignore and SA never getting the attachments in the first place...
iirc, SA gets the attachments, just doesn't parse rules against them yet
permits plugins handle the attachments.
This allows stuff like attachment hashers, OCR scanners, etc, etc handle
the attachments
The raw chunck method can also break this if SA only sees part of the
attachment due to a configure chunk limit. (been there)
Re: [Bug 6854] Optimizations, profiling
Posted by John Hardin <jh...@impsec.org>.
On Tue, 23 Oct 2012, Axb wrote:
> On 10/23/2012 10:48 PM, John Hardin wrote:
>> On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
>>
>> > My thoughts were to ignore any binary attachments.
>>
>> I don't think that's justified. I'm beginning to see a resurgence of
>> image spams that the OCR plugin would probably catch. Plus I fairly
>> regularly see 419 spams with the body of the pitch in a PDF or MS Word
>> document attachment.
>
> SA never scanned binary attachements and the chunk method wouldn't change
> that, just apply rules to content for which it was not designed for.
>
> PDF/Word attachments need to be detected by checksum or other newer methods,
> but definitely not by the existing rule methods.
> You won't get anything useful with a raw/body rule or any other regex scanner
> out of an encoded chunk of an attachment.
I'm not suggesting you would.
> Stuff like PDFinfo, Imageinfo, etc are the kind of plugis required to do foo
> against attachements.
That's my point. If we strip binary attachments, what would PDFinfo,
Imageinfo, FuzzyOCR et. al. have to work with?
Or am I misunderstanding and this stripping is occurring internally to SA
and affects what the RE rules scan? If so, I apologize, I was assuming the
context was spamc or something else client-side doing the strip/ignore and
SA never getting the attachments in the first place...
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
One death is a tragedy; thirty is a media sensation;
a million is a statistic. -- Joseph Stalin, modernized
-----------------------------------------------------------------------
145 days since the first successful private support mission to ISS (SpaceX)
Re: [Bug 6854] Optimizations, profiling
Posted by Henrik Krohns <he...@hege.li>.
On Tue, Oct 23, 2012 at 11:02:37PM +0200, Axb wrote:
> On 10/23/2012 10:48 PM, John Hardin wrote:
> >On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
> >
> >>My thoughts were to ignore any binary attachments.
> >
> >I don't think that's justified. I'm beginning to see a resurgence of
> >image spams that the OCR plugin would probably catch. Plus I fairly
> >regularly see 419 spams with the body of the pitch in a PDF or MS Word
> >document attachment.
>
> SA never scanned binary attachements and the chunk method wouldn't
> change that, just apply rules to content for which it was not
> designed for.
Just as a reminder for everyone:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6582
The problem here is SA stumbling onto large masses of data that it believes
to be "text", thus running all body rules etc on that. If we fix or limit
the impact of that, there's no reason to have any kind of silly "skip or
chop large message" kludges.
You'd only want to skip large messages completely if you are very low on
resources and can't spare cpu or few megs of memory to keep the parsed
message blobs in memory. Ok, there probably isn't much spam in the 10MB
range anymore, so you might skip that..
Re: [Bug 6854] Optimizations, profiling
Posted by Axb <ax...@gmail.com>.
On 10/23/2012 10:48 PM, John Hardin wrote:
> On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
>
>> My thoughts were to ignore any binary attachments.
>
> I don't think that's justified. I'm beginning to see a resurgence of
> image spams that the OCR plugin would probably catch. Plus I fairly
> regularly see 419 spams with the body of the pitch in a PDF or MS Word
> document attachment.
SA never scanned binary attachements and the chunk method wouldn't
change that, just apply rules to content for which it was not designed for.
PDF/Word attachments need to be detected by checksum or other newer
methods, but definitely not by the existing rule methods.
You won't get anything useful with a raw/body rule or any other regex
scanner out of an encoded chunk of an attachment.
Stuff like PDFinfo, Imageinfo, etc are the kind of plugis required to do
foo against attachements.
Re: [Bug 6854] Optimizations, profiling
Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 10/23/2012 4:48 PM, John Hardin wrote:
> On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
>
>> My thoughts were to ignore any binary attachments.
>
> I don't think that's justified. I'm beginning to see a resurgence of
> image spams that the OCR plugin would probably catch. Plus I fairly
> regularly see 419 spams with the body of the pitch in a PDF or MS Word
> document attachment.
Sorry, that's not QUITE what I meant. If an email is over the limit,
ignore the binary attachments.
Re: [Bug 6854] Optimizations, profiling
Posted by John Hardin <jh...@impsec.org>.
On Tue, 23 Oct 2012, Kevin A. McGrail wrote:
> My thoughts were to ignore any binary attachments.
I don't think that's justified. I'm beginning to see a resurgence of image
spams that the OCR plugin would probably catch. Plus I fairly regularly
see 419 spams with the body of the pitch in a PDF or MS Word document
attachment.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
My sidearm is a piece of emergency equipment. It absolutely must
be reliable, not "smart".
-----------------------------------------------------------------------
145 days since the first successful private support mission to ISS (SpaceX)
Re: [Bug 6854] Optimizations, profiling
Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 10/23/2012 4:10 PM, Axb wrote:
> On 10/23/2012 09:59 PM, Kevin A. McGrail wrote:
>> On 10/23/2012 3:48 PM, bugzilla-daemon@issues.apache.org wrote:
>>> A message larger than a certain configured size is truncated
>>> at the configured size and that is what SpamAssassin sees.
>>> No other contents processing in this data path, just
>>> blunt truncation of the raw mail message. Works quite well,
>>> certainly much better than not scanning large messages at all.
>> Makes sense to me. Something we should consider for SA to do by default
>> with spamc/spamd?
>
> why? wasn't spamassassin designed to ignore attachements instead of
> what MailScanner and Amavisd are doing using the API?
>
> Why should SA spend time scanning "binary" content it cannot decode or
> extract anything useful to apply rules?
>
> I consider the "chunk" method the worse way to do it as it may skip
> txt/html content which could show up after the configured chunk size
> while spending lots of cycles scanning a two liner with an attached
> 400kb PDF/workd/etc attachement.
> or did it get it all wrong?
It's very synergistic because I wrote a note about this 2 days ago.
With SpamC/SpamD, if the email (encoded) is larger than X size, the
email is not scanned at all.
My thoughts were to ignore any binary attachments. I even considered
writing a glue that would rewrite a temporary copy of the email to
remove binary attachments and see if THAT met the threshold. But if
real-world experience with simply chopping works well, who am I to complain?
regards,
KAM
Re: [Bug 6854] Optimizations, profiling
Posted by Axb <ax...@gmail.com>.
On 10/23/2012 09:59 PM, Kevin A. McGrail wrote:
> On 10/23/2012 3:48 PM, bugzilla-daemon@issues.apache.org wrote:
>> A message larger than a certain configured size is truncated
>> at the configured size and that is what SpamAssassin sees.
>> No other contents processing in this data path, just
>> blunt truncation of the raw mail message. Works quite well,
>> certainly much better than not scanning large messages at all.
> Makes sense to me. Something we should consider for SA to do by default
> with spamc/spamd?
why? wasn't spamassassin designed to ignore attachements instead of what
MailScanner and Amavisd are doing using the API?
Why should SA spend time scanning "binary" content it cannot decode or
extract anything useful to apply rules?
I consider the "chunk" method the worse way to do it as it may skip
txt/html content which could show up after the configured chunk size
while spending lots of cycles scanning a two liner with an attached
400kb PDF/workd/etc attachement.
or did it get it all wrong?
Axb
Re: [Bug 6854] Optimizations, profiling
Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 10/23/2012 3:48 PM, bugzilla-daemon@issues.apache.org wrote:
> A message larger than a certain configured size is truncated
> at the configured size and that is what SpamAssassin sees.
> No other contents processing in this data path, just
> blunt truncation of the raw mail message. Works quite well,
> certainly much better than not scanning large messages at all.
Makes sense to me. Something we should consider for SA to do by default
with spamc/spamd?
regards,
KAM
[Bug 6854] Optimizations, profiling
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6854
--- Comment #3 from Mark Martinec <Ma...@ijs.si> ---
(> I'm not sure I understand:
> Does Amavisd send chuncks of raw message to SA instead of only the txt/html
> parts and leave "attachments" unscanned?
A message larger than a certain configured size is truncated
at the configured size and that is what SpamAssassin sees.
No other contents processing in this data path, just
blunt truncation of the raw mail message. Works quite well,
certainly much better than not scanning large messages at all.
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 6854] Optimizations, profiling
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6854
--- Comment #2 from AXB <ax...@gmail.com> ---
(In reply to comment #0)
> Created attachment 5102 [details]
> The low-hanging fruit
>
> Spent a day with a NYTProf 4.08 Perl profiler trying to cut down
> some of the inefficiencies of SpamAssassin dealing with large mail
> messages (which are usually large thanks to some Base64-encoded
> attachments). Using Perl 5.16 on a FreeBSD 9.1 platform.
>
> Picking just the low-hanging fruit with most outstanding hotspots in
> each iteration, I managed to shave off about 100 ms of CPU-intensive
> hotspots (local tests only) in a command-line spamassassin run
> (with a 3 MB message containing a large PDF).
>
> Depending on what is being measured (like aggregate mail throughput),
> and what proportion of large messages are being passed to SpamAssassin
> (like passing only the first 420 kB from amavisd to SpamAssassin),
> this amounts to between 3 and 7 % of a speedup for large messages.
> Not too bad where every bit adds up.
I'm not sure I understand:
Does Amavisd send chuncks of raw message to SA instead of only the txt/html
parts and leave "attachments" unscanned?
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 6854] Optimizations, profiling
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6854
--- Comment #1 from Mark Martinec <Ma...@ijs.si> ---
trunk:
$ svn ci -m 'Bug 6854: Optimizations, profiling'
Sending lib/Mail/SpamAssassin/Conf/Parser.pm
Sending lib/Mail/SpamAssassin/Message.pm
Sending lib/Mail/SpamAssassin/Plugin/MIMEEval.pm
Sending lib/Mail/SpamAssassin/Plugin/VBounce.pm
Sending lib/Mail/SpamAssassin/Util.pm
Committed revision 1401393.
--
You are receiving this mail because:
You are the assignee for the bug.