You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/02/14 02:21:05 UTC

renaming PerMsgStatus class

OK, I had an idea.  Conceptually we have these main objects during
scanning:

  1. the message object passed in from the user: MsgParser?

  2. the "main" factory object used to create other objects and
     store global state: Mail::SpamAssassin

  3. a scanner object: Mail::SpamAssassin::PerMsgStatus, so called
     because it holds "per-message status information"

I propose we rename #3 to: Mail::SpamAssassin::Scan.   Why? because
a scanning operation is conceptually:

  - user creates Mail::SpamAssassin factory

  - user creates Mail::SpamAssassin::MsgParser object with the mail text
    he wants to check

  - user passes Mail::SpamAssassin::MsgParser object to
    Mail::SpamAssassin::check() method

  - that creates, runs and returns a Mail::SpamAssassin::Scan object
    containing data about that scan operation

  - (repeat steps 2-4)

In other words, the Mail::SpamAssassin::Scan object is data about a "scan
operation".   Hence "Scan".  make sense?  (it's certainly shorter than
"PerMsgStatus".)

--j.

Re: renaming PerMsgStatus class

Posted by Theo Van Dinter <fe...@kluge.net>.

On Wed, Feb 18, 2004 at 01:07:30AM -0500, Theo Van Dinter wrote:
> I've been going back and forth with whether or not this is a good idea.
> So far, it looks like the worst case (doing checks) nets essentially the
> same timing, and the best case (removing markup, etc) gets a time benefit.
> This puts me in the "lets do it" category at the moment.

checked in ... r6716.  I added some docs and such too, in a good doobie
fashion. ;)

-- 
Randomly Generated Tagline:
Well, hey, let's just make everything into a closure, and then we'll
 have our general garbage collector, installed by "use less memory".
              -- Larry Wall in <19...@wall.org>

Re: renaming PerMsgStatus class

Posted by Theo Van Dinter <fe...@kluge.net>.

On Tue, Feb 17, 2004 at 10:15:51PM -0500, Theo Van Dinter wrote:
> I dunno, any thoughts?

Just for kicks, I coded this up.  It turned out not to be too too bad --
only had to add code to 3 routines in MsgContainer to look to see if
the message was parsed yet.  Here's some "make test" output, just to
see some rough time estimates:

Went from ... (r6714)

Files=48, Tests=414, 258 wallclock secs (55.41 cusr +  5.28 csys = 60.69 CPU)
Files=48, Tests=414, 225 wallclock secs (54.78 cusr +  5.25 csys = 60.03 CPU)
Files=48, Tests=414, 229 wallclock secs (55.38 cusr +  5.08 csys = 60.46 CPU)

to ...

Files=48, Tests=414, 220 wallclock secs (55.08 cusr +  4.89 csys = 59.97 CPU)
Files=48, Tests=414, 216 wallclock secs (54.71 cusr +  5.12 csys = 59.83 CPU)
Files=48, Tests=414, 221 wallclock secs (55.39 cusr +  4.70 csys = 60.09 CPU)


So ...  Everything still passes, and we don't really end up with any
change in time.  There were 3 functions that needed to check if the parse
was done already, but I was able to get that down to just 1 function.
I was then able to optimize out (relatively) the penalty involved in the
"do I need to parse" check.

I don't really like "make test" as a speed indicator, but it looks like
the check() part stays the same speed, but code such as "spamassassin -d"
(anything that doesn't end up calling find_parts() actually) will speed up.

I've been going back and forth with whether or not this is a good idea.
So far, it looks like the worst case (doing checks) nets essentially the
same timing, and the best case (removing markup, etc) gets a time benefit.
This puts me in the "lets do it" category at the moment.

-- 
Randomly Generated Tagline:
Wait a minute, Marge.  I saw "Mrs. Doubtfire."  This is a man in drag!
 
 		-- Homer Simpson
 		   Simpsoncalifragilisticexpiala(annoyed grunt)ocious

Re: renaming PerMsgStatus class

Posted by Theo Van Dinter <fe...@kluge.net>.

On Tue, Feb 17, 2004 at 02:37:44PM -0800, Justin Mason wrote:
> Yeah, +1 to this IMO.

Change went in -- r6713.  It's now M::SA->parse(), taking the same
parameters as the M::SA::MsgParser version.

Another thing JM and I were talking about is to not have parse()
automatically go through the MIME parts.  Basically the first call
generates a basic MsgContainer object, then calls which need the parsed
version cause the MIME parts to get parsed.

Part of me thinks this is a good idea, and part of me doesn't.  The part
of me that does thinks that this would save CPU time at creation since
if the code isn't going to use the parsed bits ('spamassassin -d'
for example), then there's no point in going through the whole MIME
parsing code.

The other part of me thinks that it'll end up causing a lot of calls to
"do_I_need_to_parse()" when the majority of the time SA will be running
in check mode, which will need the parsed version, thereby eliminating
the speed increase we think we're getting.  The whole "optimize for the
common case" thing.

In the greater scheme of things, the parsing doesn't take very long.
It was the decoding/rendering that took the CPU time, but that happens
"on-the-fly" now when the decoded/rendered part is requested, so we're
pretty much just looping through an array now and generating an object
tree.  The time to do that is pretty small in relation to everything
else that's going on.

I was also thinking of making an option to parse that says whether to
do the MIME part parse now or later.  By default it'll happen "now",
but if set to later, the caller could call parse_now() or something
which would set things up ...

I dunno, any thoughts?

-- 
Randomly Generated Tagline:
"Oh, well; I'll work next week."         - Peter Sagerson

Re: renaming PerMsgStatus class

Posted by Theo Van Dinter <fe...@kluge.net>.

On Sat, Feb 14, 2004 at 01:35:50AM -0800, Dan Quinlan wrote:
> >   1. the message object passed in from the user: MsgParser?
> 
> Question: should it just be named "Message" ?

Well, I'm changing the API again anyway...  (I talked this over with JM anyway...)

The way it is now, you call MsgParser::parse() with the message, and it
returns a MsgContainer object which is the root of the parse tree.

The new way is basically the same, except you call M::SA::parse() with
the message, and it'll return a <insert name here -- Message?> object.
The goal being that we can change the backend in the future without
needing to change the user-facing API again.

-- 
Randomly Generated Tagline:
As usual, this being a 1.3.x release, I haven't even compiled this
 kernel yet.  So if it works, you should be doubly impressed.
 (Linus Torvalds, announcing kernel 1.3.3 on the linux-kernel mailing list.)

RE: renaming PerMsgStatus class

Posted by Gary Funck <ga...@intrepid.com>.


I'm not following this thread closely, and don't know the SA API well,
but have used a couple of the mail-parsing-related modules on CPAN.
I don't know if it is a goal of SA developers to offer a "better,
more robust" mail parser, but it'd be nice if there were one. The
mail parsers I've tried do not seem to deal well with certain kinds
of non-compliant mail messages -- which of course occur frequently
in spam mail.

Long story short, is it out of the question to break out SA's mail parsing
into a module that has a more generic, functional module name that would
fit into the schema of some of the other mail parsers out there? I suppose
that was the case in current releases where the SA mail parser was a
sub-class
of Mail::Audit?

(I'm speaking from a very uninformed point of view, but hopefully the point
is
clear: to view the message parser as useful stand alone module in its own
right,
and possibly reenforce this by giving it a name and structure that fits in
with
existing mail parsing modules.)

Re: renaming PerMsgStatus class

Posted by Daniel Quinlan <qu...@pathname.com>.

jm@jmason.org (Justin Mason) writes:

> OK, I had an idea.  Conceptually we have these main objects during
> scanning:
>
>   1. the message object passed in from the user: MsgParser?

Question: should it just be named "Message" ?

>   2. the "main" factory object used to create other objects and
>      store global state: Mail::SpamAssassin

Okay.

>   3. a scanner object: Mail::SpamAssassin::PerMsgStatus, so called
>      because it holds "per-message status information"

I'm with you...

> I propose we rename #3 to: Mail::SpamAssassin::Scan.   Why? because
> a scanning operation is conceptually:
>
>   - user creates Mail::SpamAssassin factory
>
>   - user creates Mail::SpamAssassin::MsgParser object with the mail text
>     he wants to check

Mail::SpamAssassin::Message

>   - user passes Mail::SpamAssassin::MsgParser object to
>     Mail::SpamAssassin::check() method
>
>   - that creates, runs and returns a Mail::SpamAssassin::Scan object
>     containing data about that scan operation

Okay.

>   - (repeat steps 2-4)
>
> In other words, the Mail::SpamAssassin::Scan object is data about a
> "scan operation".  Hence "Scan".  make sense?  (it's certainly shorter
> than "PerMsgStatus".)

"Scan" sounds good to me.

Just so it's been asked, is the current top-level object design and this
approach the right way?

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

MsgParser API change (was: Re: renaming PerMsgStatus class)

Posted by Theo Van Dinter <fe...@kluge.net>.

On Fri, Feb 13, 2004 at 05:21:05PM -0800, Justin Mason wrote:
>   1. the message object passed in from the user: MsgParser?

MsgContainer actually.  MsgParser just parses the message.  I had some
thoughts about this this morning...:

Let's make people call M::SA::parse(), which just calls MsgParser on
the back-end.  You still get a MsgContainer (or whatever) object out
so you can prep multiple messages to send through, but we don't have
to worry about the API changing again in the future since we'll control
what M::SA::parse() calls.

-- 
Randomly Generated Tagline:
if (quackslike(X)==DUCK) return DUCK;