You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/07/08 17:52:24 UTC
[Bug 4469] New: Add a process/option to efficiently deal with very long mail messages
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469
Summary: Add a process/option to efficiently deal with very long
mail messages
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Platform: Other
OS/Version: other
Status: NEW
Severity: enhancement
Priority: P4
Component: spamc/spamd
AssignedTo: dev@spamassassin.apache.org
ReportedBy: lwilton@earthlink.net
There are starting to be occasional reports of very large spams that make it
past SA by virtue of the length cutoff limit.
Passing the entire message to SA would of course not be a Good Thing to do.
However, armchair reasoning suggests that the spaminess of the message can
probably be determined reasonably accurately from the headers and the first
2..10K or so of the message body in virtually all cases. In fact, this is
probably virtually always true, even with messages in the 20K..250K range.
Suggest two things here: an option to SA (perhaps a special line on the front
of the message stream itself) that tells it that this will be a partial
message, and secondly a change to spamd to pass partial messages, along with
this flag, when some size limit is exceeded.
Since only a partial message is being passed, obviously spamd can't just pipe
the entire message thru SA and out the other end. Instead, it will have to get
a declaration from SA of spaminess, and then do something itself with the
original message.
The purpose of the flag to SA for a partial message would be twofold: it would
disable some of the rules that expect correct mime-part terminations, and it
might change the output from SA to perhaps only be headers for the message,
plus a return value that somehow indicates spam. This return value might be in
the form of a real return value, or a first header line with special
formatting, or perhaps something else.
If SA operating in this mode returned modified headers only, it would be
trivial for the spamd child to remove the original message headers and replace
them with the SA-supplied headers, and pipe the rest of the message straight
through, thus avoiding the SA large-message overhead.
However this sort of option is implemented (if it is), it should be done in a
way that tools calling SA or the SA API directly can fairly easily implement
spam detection using this option.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4469] Add a process/option to efficiently deal with very long mail messages
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469
------- Additional Comments From felicity@apache.org 2005-07-08 10:26 -------
Subject: Re: Add a process/option to efficiently deal with very long mail messages
On Fri, Jul 08, 2005 at 10:10:15AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> IMO, we should take the qpsmtpd approach, too, in terms of storage of the full
> pristine message -- if the size goes over the scanning-size threshold, the
> remainder of the message data is written to a temp file instead of stored on
> disk. (we already use temp files anyway in parts of the code.)
Yeah, I was thinking of something simliar where text/* parts (at least)
are kept in memory, but other parts are stored in temp files since they'll
only be rarely used if at all. Heck, even keep the filename in the part
information so that if a plugin wants to call an AV scanner, or something,
on that part it'd be easy to just point at the file instead of creating
a whole new temp file from the other temp file. ;)
In the original SA3 code, BTW, everything was a temp file. Since that
seemed overly complicated since each part can have multiple versions,
etc, it was converted to the "all in memory" version.
> This would allow us to scan even 100MB mails without breaking a sweat and
> causing all those FAQs on the users list. ;)
Well, yes and no. There's still the hit of storing the message in memory,
at least once, when it's initially read in. We could store the pristine
body in a temp file, but then any full rules or the rewrite at the end
will cause that to come back in.
SA is really tuned for "everything in memory".
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4469] Add a process/option to efficiently deal with very long mail messages
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469
------- Additional Comments From lwilton@earthlink.net 2005-07-08 10:48 -------
Subject: Re: Add a process/option to efficiently deal with very long mail messages
I have some (perhaps incorrect) memory that Bayes learning is limited to
some KB of the message since there was no real use to going further.
Perhaps the same limit would be reasonable for normal scanning?
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4469] Add a process/option to efficiently deal with very long mail messages
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469
------- Additional Comments From jm@jmason.org 2005-07-08 10:10 -------
yes, I agree something like this would be a worthwhile approach. fwiw, I'd
prefer to do this entirely inside the Mail::SA modules, however.
IMO, we should take the qpsmtpd approach, too, in terms of storage of the full
pristine message -- if the size goes over the scanning-size threshold, the
remainder of the message data is written to a temp file instead of stored on
disk. (we already use temp files anyway in parts of the code.)
This would allow us to scan even 100MB mails without breaking a sweat and
causing all those FAQs on the users list. ;)
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4469] Add a process/option to efficiently deal with very long mail messages
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469
------- Additional Comments From lwilton@earthlink.net 2005-07-08 10:51 -------
Subject: Re: Add a process/option to efficiently deal with very long mail messages
> > This would allow us to scan even 100MB mails without breaking a sweat
and
> > causing all those FAQs on the users list. ;)
>
> Well, yes and no. There's still the hit of storing the message in memory,
> at least once, when it's initially read in. We could store the pristine
> body in a temp file, but then any full rules or the rewrite at the end
> will cause that to come back in.
>
> SA is really tuned for "everything in memory".
Which is why I suggested doing this in spamd and just passing the
'reasonable size' to SA itself. It eliminates all those niggling worries
about some line of code somewhere suddenly sucking in 100mb of text to a
hash or the like.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4469] Add a process/option to efficiently deal with very long mail messages
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469
Bob@Menschel.net changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|Undefined |Future
------- Additional Comments From Bob@Menschel.net 2005-07-08 22:36 -------
Ref bug 2977
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4469] Add a process/option to efficiently deal with very long mail messages
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469
------- Additional Comments From automasschecker@jmason.org 2005-07-08 10:40 -------
Subject: Re: Add a process/option to efficiently deal with very long mail messages
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> In the original SA3 code, BTW, everything was a temp file. Since that
> seemed overly complicated since each part can have multiple versions,
> etc, it was converted to the "all in memory" version.
it's also slower. the qpsmtpd algorithm is nice, both for speed and RAM:
it goes like this:
my $buffer;
my $tmpfile_handle; # closed and unset
my $tmpfile_open = 0;
while (reading) {
if (size > some_limit) {
if (!$tmpfile_open) {
$tmpfile_open = 1;
# generate tmpfile name
# open tmpfile, if not already open
}
# write to $tmpfile_handle
}
else {
# add to buffer
}
}
so the benefit is that the buffer contains the text part we're prepared to
scan, and the tmpfile is only ever opened (and disk I/O incurred) for
massive mails.
> > This would allow us to scan even 100MB mails without breaking a sweat and
> > causing all those FAQs on the users list. ;)
>
> Well, yes and no. There's still the hit of storing the message in memory,
> at least once, when it's initially read in. We could store the pristine
> body in a temp file, but then any full rules or the rewrite at the end
> will cause that to come back in.
full rules: change the semantics to only match the first 250k of the
message data
rewrite: add a new iterator interface as well as the old all-in-RAM
interface
- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFCzro2MJF5cimLx9ARAhv+AJ9KvZcVbkPlBKOGmo7wIRrFIzgWsACgmCXT
mEDzMudMpTcoZwDKkkrzjJc=
=Mf8Z
-----END PGP SIGNATURE-----
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4469] Add a process/option to efficiently deal with very long mail messages
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469
------- Additional Comments From jgmyers@proofpoint.com 2005-07-08 10:36 -------
I have a plugin that processes non-text parts in perl. I would appreciate
continuing to be able to do so.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.