You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Herold Heiko <He...@previnet.it> on 2005/04/11 18:30:21 UTC

Gateways, analyze first, insert into bayes later ?

Newbie Alert - New to Spamassassin. Pondering enhancement to my current
basic setup, which is a filter gateway in front of MS exchange.
Filter gw is amavisd-new + dual-sendmail-setup + clamav+spamassassin 3.02. 

I'm looking how to feed back sorted spam/ham info into the spamassassin
bayes database, skimming through the list archives I basically found people
talking about some different possibilities I basically was thinking about,
too:

- feed msgs back the spam/ham with a "forward". Problem: outlook munges
vital headers, attachments are possibly in different encoding, since
exchange decoded the whole body and attachments, and re-encodes them again
on forward - after all internally exchange isn't based on smtp (at least not
exch55 which we are still using).

- feed msgs back by having the users copy/paste the headers into the
"forward" email, extract and reconstruct somehow. Problem: cumbersome
(management would certainly yell), still the body/attachment encoding
problem.

- Have users sort Spam (and wrongly marked Ham) in different folder, attach
with CDO or OLE automation of outlook. Users are happy, but the whole
message would need reconstruction based on original headers, body and
attachments, losing valuable information.

- Have users sort Spam and Ham in different folder, extract with IMAP. Users
are happy, headers should be fine, but still I think the original encoding
used for body and attachments are lost, what we feed back to sa-learn is a
freshly reencoded (by exchange) mail.

Anybody with more knowledge of the working of Spamassassin can tell me if
the loss of the original encoding of body and attachments is a VERY BAD
THING ?

If it is, I was thinking, Spamassassin did already analyse all those
(inbound) messages the first time when delivered.
Is it possible (are there any hooks to...) extract the logical information
of that analyzation ?
I didn't yet find anything relevant in the Mail::SpamAssassin pod, I suppose
I'll have to check the gory details of the learn() and parse() methods.
Possibly the returned Mail::SpamAssassin::PerMsgLearner object will be
useful.

So we could save that information (for some time... say a couple of weeks,
depends on size and so on) using the message-id as a key.
Later then instead of sa-learn -spam <path_to_spam_msg we could retrieve
that info (extract the msg-id from the headers, retrieve analyze data from
db) and feed it back.

Anybody with better knowledge of the internal workings of SpamAssassin could
tell me
- if this is even necessary / useful ? After all I AM a newbie in this area,
maybe there is some other easy way I didn't spot yet, OR the loss of the
original encoding is not so important

- if this is already possible

- if not, if this could be possible with the current codebase. I suppose so,
basically in learn() locate the necessary data structures, encode in
standard and portable format, save it somewhere. Reverse at inserting stage.

- any pointer where to start implementing the hooks ore pitfall to avoid

- if something similar possibly is already wip somewhere 

Thanks

Heiko Herold

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold Heiko.Herold@previnet.it Sistemisti@previnet.it
-- +39-041-5907073 ph
-- +39-041-5907472 fax

Re: Gateways, analyze first, insert into bayes later ?

Posted by Matt Yackley <sa...@yackley.org>.
Hi Herold,

Are you using a sitewide bayes DB?  This may affect your choice of solutions, I'm
running sitewide, so my method may not work if you are using seperate DBs for all
your users....

Herold Heiko said:
> Newbie Alert - New to Spamassassin. Pondering enhancement to my current
> basic setup, which is a filter gateway in front of MS exchange.
> Filter gw is amavisd-new + dual-sendmail-setup + clamav+spamassassin 3.02.
>
> I'm looking how to feed back sorted spam/ham info into the spamassassin
> bayes database, skimming through the list archives I basically found people
> talking about some different possibilities I basically was thinking about,
> too:
>
> - feed msgs back the spam/ham with a "forward".

If you have to go with a "forward" option it would be best to "forward as
attachemnt" which would preserve the headers, but then creates an issues of
"unwrapping" the attached message, I seen this mentioned many times, but have never
seen a script to do this :(

>
> - Have users sort Spam (and wrongly marked Ham) in different folder, attach
> with CDO or OLE automation of outlook. Users are happy, but the whole
> message would need reconstruction based on original headers, body and
> attachments, losing valuable information.

I use a public folders for message submission, users can see the folders, create
messages in them, but can't view or change the contents.  At first we had the users
drag and drop messages into these folders, but navigation is a bit of a pain. 
Instead I workedtalked with a dev here at work and he wrote a small plugin for
Outlook that adds a "Learn as spam" and "Learn as Ham" button to the main toolbar in
Outlook.  The spam button "moves" a message to spam folder and the "ham" button
copies the message.  Its quick and easy for the users and has been working well for
us, now I just need to time to document it a bit and release it for others to use. 
Now on to the other issues... :)

> - Have users sort Spam and Ham in different folder, extract with IMAP. Users
> are happy, headers should be fine, but still I think the original encoding
> used for body and attachments are lost, what we feed back to sa-learn is a
> freshly reencoded (by exchange) mail.

Are you thing of having the users "push" the messages to the relay server or pulling
the message out of Exchange from the relay server?

Extracting messages from public folders via IMAP is somewhat broken in Ex 2000 &
2003, not sure about 5.5.  It tend to drop all headers except for received, date,
subject and inserts some of its own.  This isn't good, but my bayes still works
pretty darn well.  (I have a ticket open with MS about this)

> Anybody with more knowledge of the working of Spamassassin can tell me if
> the loss of the original encoding of body and attachments is a VERY BAD
> THING ?

I don't believe that bayes will process attachments in 3.x and above, the encoding
may change somewhat, but hopefully the majority of messages will be ok.  So I would
say its a bad or a not so good thing, but not a very bad thing...overall

> If it is, I was thinking, Spamassassin did already analyse all those
> (inbound) messages the first time when delivered.
snip
>
> So we could save that information (for some time... say a couple of weeks,
> depends on size and so on) using the message-id as a key.
> Later then instead of sa-learn -spam <path_to_spam_msg we could retrieve
> that info (extract the msg-id from the headers, retrieve analyze data from
> db) and feed it back.

This is something that I have talked about with the dev at work.. perhaps use amavis
or postfix (in my case) to save a copy of all messages, then write something to pull
the msg ID out of submitted messages and then pull the "original" out of the "raw
message store" on the relay server.  If MS can't fix my IMAP header issue, then we
may look at trying to write something.

> Anybody with better knowledge of the internal workings of SpamAssassin could
> tell me
> - if this is even necessary / useful ? After all I AM a newbie in this area,
> maybe there is some other easy way I didn't spot yet, OR the loss of the
> original encoding is not so important

I'll have to let someone else who knows more answer that one.


> Thanks
>
> Heiko Herold

If you want to go the public folder route, be sure to check out Nick Burch's
power-imap-sa-learn script. http://tirian.magd.ox.ac.uk/~nick/code/

Cheers,
matt