You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/07/08 17:52:24 UTC

[Bug 4469] New: Add a process/option to efficiently deal with very long mail messages

http://bugzilla.spamassassin.org/show_bug.cgi?id=4469

           Summary: Add a process/option to efficiently deal with very long
                    mail messages
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: enhancement
          Priority: P4
         Component: spamc/spamd
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: lwilton@earthlink.net


There are starting to be occasional reports of very large spams that make it 
past SA by virtue of the length cutoff limit.

Passing the entire message to SA would of course not be a Good Thing to do.  
However, armchair reasoning suggests that the spaminess of the message can 
probably be determined reasonably accurately from the headers and the first 
2..10K or so of the message body in virtually all cases.  In fact, this is 
probably virtually always true, even with messages in the 20K..250K range.

Suggest two things here: an option to SA (perhaps a special line on the front 
of the message stream itself) that tells it that this will be a partial 
message, and secondly a change to spamd to pass partial messages, along with 
this flag, when some size limit is exceeded.  

Since only a partial message is being passed, obviously spamd can't just pipe 
the entire message thru SA and out the other end.  Instead, it will have to get 
a declaration from SA of spaminess, and then do something itself with the 
original message.

The purpose of the flag to SA for a partial message would be twofold: it would 
disable some of the rules that expect correct mime-part terminations, and it 
might change the output from SA to perhaps only be headers for the message, 
plus a return value that somehow indicates spam.  This return value might be in 
the form of a real return value, or a first header line with special 
formatting, or perhaps something else.

If SA operating in this mode returned modified headers only, it would be 
trivial for the spamd child to remove the original message headers and replace 
them with the SA-supplied headers, and pipe the rest of the message straight 
through, thus avoiding the SA large-message overhead.

However this sort of option is implemented (if it is), it should be done in a 
way that tools calling SA or the SA API directly can fairly easily implement 
spam detection using this option.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4469] Add a process/option to efficiently deal with very long mail messages

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469





------- Additional Comments From felicity@apache.org  2005-07-08 10:26 -------
Subject: Re:  Add a process/option to efficiently deal with very long mail messages

On Fri, Jul 08, 2005 at 10:10:15AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> IMO, we should take the qpsmtpd approach, too, in terms of storage of the full
> pristine message -- if the size goes over the scanning-size threshold, the
> remainder of the message data is written to a temp file instead of stored on
> disk.  (we already use temp files anyway in parts of the code.)

Yeah, I was thinking of something simliar where text/* parts (at least)
are kept in memory, but other parts are stored in temp files since they'll
only be rarely used if at all.  Heck, even keep the filename in the part
information so that if a plugin wants to call an AV scanner, or something,
on that part it'd be easy to just point at the file instead of creating
a whole new temp file from the other temp file. ;)

In the original SA3 code, BTW, everything was a temp file.  Since that
seemed overly complicated since each part can have multiple versions,
etc, it was converted to the "all in memory" version.

> This would allow us to scan even 100MB mails without breaking a sweat and
> causing all those FAQs on the users list. ;)

Well, yes and no.  There's still the hit of storing the message in memory,
at least once, when it's initially read in.  We could store the pristine
body in a temp file, but then any full rules or the rewrite at the end
will cause that to come back in.

SA is really tuned for "everything in memory".





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4469] Add a process/option to efficiently deal with very long mail messages

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469





------- Additional Comments From lwilton@earthlink.net  2005-07-08 10:48 -------
Subject: Re:  Add a process/option to efficiently deal with very long mail messages

I have some (perhaps incorrect) memory that Bayes learning is limited to
some KB of the message since there was no real use to going further.
Perhaps the same limit would be reasonable for normal scanning?





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4469] Add a process/option to efficiently deal with very long mail messages

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469





------- Additional Comments From jm@jmason.org  2005-07-08 10:10 -------
yes, I agree something like this would be a worthwhile approach.  fwiw, I'd
prefer to do this entirely inside the Mail::SA modules, however.

IMO, we should take the qpsmtpd approach, too, in terms of storage of the full
pristine message -- if the size goes over the scanning-size threshold, the
remainder of the message data is written to a temp file instead of stored on
disk.  (we already use temp files anyway in parts of the code.)

This would allow us to scan even 100MB mails without breaking a sweat and
causing all those FAQs on the users list. ;)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4469] Add a process/option to efficiently deal with very long mail messages

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469





------- Additional Comments From lwilton@earthlink.net  2005-07-08 10:51 -------
Subject: Re:  Add a process/option to efficiently deal with very long mail messages

> > This would allow us to scan even 100MB mails without breaking a sweat
and
> > causing all those FAQs on the users list. ;)
>
> Well, yes and no.  There's still the hit of storing the message in memory,
> at least once, when it's initially read in.  We could store the pristine
> body in a temp file, but then any full rules or the rewrite at the end
> will cause that to come back in.
>
> SA is really tuned for "everything in memory".

Which is why I suggested doing this in spamd and just passing the
'reasonable size' to SA itself.  It eliminates all those niggling worries
about some line of code somewhere suddenly sucking in 100mb of text to a
hash or the like.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4469] Add a process/option to efficiently deal with very long mail messages

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469


Bob@Menschel.net changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |Future




------- Additional Comments From Bob@Menschel.net  2005-07-08 22:36 -------
Ref bug 2977



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4469] Add a process/option to efficiently deal with very long mail messages

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469





------- Additional Comments From automasschecker@jmason.org  2005-07-08 10:40 -------
Subject: Re:  Add a process/option to efficiently deal with very long mail messages 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> In the original SA3 code, BTW, everything was a temp file.  Since that
> seemed overly complicated since each part can have multiple versions,
> etc, it was converted to the "all in memory" version.

it's also slower. the qpsmtpd algorithm is nice, both for speed and RAM:
it goes like this:

  my $buffer;
  my $tmpfile_handle;       # closed and unset
  my $tmpfile_open = 0;
  while (reading) {
    if (size > some_limit) {
      if (!$tmpfile_open) {
        $tmpfile_open = 1;
        # generate tmpfile name
        # open tmpfile, if not already open
      }
      # write to $tmpfile_handle
    }
    else {
      # add to buffer
    }
  }

so the benefit is that the buffer contains the text part we're prepared to
scan, and the tmpfile is only ever opened (and disk I/O incurred) for
massive mails.

> > This would allow us to scan even 100MB mails without breaking a sweat and
> > causing all those FAQs on the users list. ;)
> 
> Well, yes and no.  There's still the hit of storing the message in memory,
> at least once, when it's initially read in.  We could store the pristine
> body in a temp file, but then any full rules or the rewrite at the end
> will cause that to come back in.

full rules: change the semantics to only match the first 250k of the
message data

rewrite: add a new iterator interface as well as the old all-in-RAM
interface

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCzro2MJF5cimLx9ARAhv+AJ9KvZcVbkPlBKOGmo7wIRrFIzgWsACgmCXT
mEDzMudMpTcoZwDKkkrzjJc=
=Mf8Z
-----END PGP SIGNATURE-----





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4469] Add a process/option to efficiently deal with very long mail messages

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4469





------- Additional Comments From jgmyers@proofpoint.com  2005-07-08 10:36 -------
I have a plugin that processes non-text parts in perl.  I would appreciate
continuing to be able to do so.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.