You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by to...@tacocat.net on 2007/01/19 21:09:06 UTC

spampd

I was looking at using spampd for my content_filter in postfix and found
a comment in the debian package description that this is not suitable
for use with a per-user Bayesian filtering configuration.

I wanted to confirm that this was still valid or not.

I eventually intend on setting up a database backend for the Bayesian
wordlists that are stored per-user and am unclear as to what the issue
might be preventing spampd from being suitable for this application.

I'm also curious if anyone can provide some vague idea of how long it
takes to process emails?  I know it varies on a million conditions but
I'm asking if it's closer to 1 second, 10 seconds, 100 seconds, 1000
seconds.  Anything like that would be fine.

Thanks!

Re: spampd

Posted by Tom Allison <to...@tacocat.net>.
Theo Van Dinter wrote:
> (I assume this message was also supposed to goto the users list since there
> was nothing private in it, so cc'ing there.)
> 
> On Fri, Jan 19, 2007 at 03:33:27PM -0500, tom@tacocat.net wrote:
>> The thought I was struggling with is that in the MTA the content_filter
>> is told who the email is going to via MAIL TO:.  So wouldn't it be
>> possible to loop through each address and perform the spam filtering? 
> 
> No.  First, the MTA won't loop over envelope recipients so you can't do that.
> Second, the envelope recipient may not be the final user that the mail is
> going to.  Third, there is only a single message, so even if you could scan it
> multiple times what markup would you use in the message?
> 
> However, running at delivery time via the MDA does exactly what you want and
> solves the above problems. :)
> 
>> Might be timely and expensive to do so but ..
> 
> Yep.  That's the main trade-off between site-wide and per-user scanning.
> Resource usage versus personalization.
> 
>> I guess spampd as it stands wouldn't do this.
>> But could it serve as a starting point to do this?
> 
> What you'd have to do is take an input message, then for each recipient scan
> and inject the result with one message per recipient.  But that's a horribly
> inefficient way to do the same thing as running from the MDA.
> 

Agreed.  But I'm trying to all of this from a mail relay and not at the 
insertion point to the MDA.  Which I guess means a lot of compromises.

Re: spampd

Posted by Theo Van Dinter <fe...@kluge.net>.
(I assume this message was also supposed to goto the users list since there
was nothing private in it, so cc'ing there.)

On Fri, Jan 19, 2007 at 03:33:27PM -0500, tom@tacocat.net wrote:
> The thought I was struggling with is that in the MTA the content_filter
> is told who the email is going to via MAIL TO:.  So wouldn't it be
> possible to loop through each address and perform the spam filtering? 

No.  First, the MTA won't loop over envelope recipients so you can't do that.
Second, the envelope recipient may not be the final user that the mail is
going to.  Third, there is only a single message, so even if you could scan it
multiple times what markup would you use in the message?

However, running at delivery time via the MDA does exactly what you want and
solves the above problems. :)

> Might be timely and expensive to do so but ..

Yep.  That's the main trade-off between site-wide and per-user scanning.
Resource usage versus personalization.

> I guess spampd as it stands wouldn't do this.
> But could it serve as a starting point to do this?

What you'd have to do is take an input message, then for each recipient scan
and inject the result with one message per recipient.  But that's a horribly
inefficient way to do the same thing as running from the MDA.

-- 
Randomly Selected Tagline:
"I like to water my plants with ice cubes just to tease them."
         - Bob Lazarus

Re: spampd

Posted by Theo Van Dinter <fe...@apache.org>.
On Fri, Jan 19, 2007 at 03:09:06PM -0500, tom@tacocat.net wrote:
> I was looking at using spampd for my content_filter in postfix and found
> a comment in the debian package description that this is not suitable
> for use with a per-user Bayesian filtering configuration.
> 
> I wanted to confirm that this was still valid or not.

If you're scanning in the MTA, you can't have per user configs or DBs.
Doesn't matter if it's spampd, mailscanner, amavis, etc.

> I eventually intend on setting up a database backend for the Bayesian
> wordlists that are stored per-user and am unclear as to what the issue
> might be preventing spampd from being suitable for this application.

Two things which are related:  1) SA can't properly know who to scan the
message for and therefore can't know which DB to use, because 2) the message
can have multiple recipients or it may be relayed to another machine or ...

In short, if you want per-user DBs, you need to run at delivery time in the
MDA.

> I'm also curious if anyone can provide some vague idea of how long it
> takes to process emails?  I know it varies on a million conditions but
> I'm asking if it's closer to 1 second, 10 seconds, 100 seconds, 1000
> seconds.  Anything like that would be fine.

In a quick grep through my maillog, the top 5 times (in seconds) are as
follows, out of 62926 total messages scanned:

  21976 2 (35%)
  11029 3 (17%)
   8625 4 (14%)
   6103 1 (10%)
   5073 5 (8%)

So a majority (62%) is ~3s or less, and 84% is 5s or less.

-- 
Randomly Selected Tagline:
"It's God.  No, not Richard Stallman, or Linus Torvalds, but God."
 (By Matt Welsh)