You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by btb <li...@bitrate.net> on 2014/08/27 23:06:10 UTC

sanitizing/normalizing messages for feeding sa-learn

hi-

we have a system [zimbra] where users can select a message in the mua 
interface and click a spam or not spam button.  this generates a message 
[containing the selected message] which is ultimately delivered to a 
mailbox.  i intend on retrieving these messages via imap and feeding 
sa-learn, but they've been a bit adulterated by the time they're 
retrieved, and i believe some cleanup is probably necessary prior to 
feeding sa-learn.

here are two samples:

http://dpaste.com/0B6S3FN.txt [claimed to be spam]
http://dpaste.com/3ZZ733Z.txt [claimed to be not spam]

the original message is encapsulated as an attachment, so i was planning 
on extracting this and discarding the rest of the message - unless 
sa-learn is magical enough to handle this?

aside from that, i've read 
https://wiki.apache.org/spamassassin/BayesInSpamAssassin and man 1 
sa-learn about spamassassin markup/headers, but would appreciate any 
feedback for the above samples that might be pertinent - particular 
headers that i may not have considered removing, etc.

thanks
-ben

Re: sanitizing/normalizing messages for feeding sa-learn

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

On 27.08.14 17:06, btb wrote:
>we have a system [zimbra] where users can select a message in the mua 
>interface and click a spam or not spam button.  this generates a 
>message [containing the selected message] which is ultimately 
>delivered to a mailbox.  i intend on retrieving these messages via 
>imap and feeding sa-learn, but they've been a bit adulterated by the 
>time they're retrieved, and i believe some cleanup is probably 
>necessary prior to feeding sa-learn.

Should not be that necessary. Hopefully Zimbra does not alter messages as
bad as Outlook/Exchange does (what should I tell you? I've been trying to
block spam with specific address in From: ... after I blocked according to
the Subject, I found out that real From: is very different)

>here are two samples:
>
>http://dpaste.com/0B6S3FN.txt [claimed to be spam]
>http://dpaste.com/3ZZ733Z.txt [claimed to be not spam]
>
>the original message is encapsulated as an attachment, so i was 
>planning on extracting this and discarding the rest of the message - 
>unless sa-learn is magical enough to handle this?

it is not, but extracting original message should be enough.

>aside from that, i've read 
>https://wiki.apache.org/spamassassin/BayesInSpamAssassin and man 1 
>sa-learn about spamassassin markup/headers, but would appreciate any 
>feedback for the above samples that might be pertinent - particular 
>headers that i may not have considered removing, etc.

I would remove no headers, SA should handle that properly.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Your mouse has moved. Windows NT will now restart for changes to take
to take effect. [OK]

Re: sanitizing/normalizing messages for feeding sa-learn

Posted by li...@bitrate.net.

On Aug 27, 2014, at 18.13, Quanah Gibson-Mount <qu...@zimbra.com> wrote:

> --On Wednesday, August 27, 2014 6:06 PM -0400 btb <li...@bitrate.net> wrote:
> 
>> hi-
>> 
>> we have a system [zimbra] where users can select a message in the mua
>> interface and click a spam or not spam button.  this generates a message
>> [containing the selected message] which is ultimately delivered to a
>> mailbox.  i intend on retrieving these messages via imap and feeding
>> sa-learn, but they've been a bit adulterated by the time they're
>> retrieved, and i believe some cleanup is probably necessary prior to
>> feeding sa-learn.
> 
> That seems rather convoluted, given that Zimbra already trains its SA database automatically on a nightly basis based on the messages user submit via marking things as Spam.  Are you running your own SA outside of Zimbra?

yes, our mx/mta/msa/content filtering infrastructure is completely separate from zimbra.

-ben

Re: sanitizing/normalizing messages for feeding sa-learn

Posted by Quanah Gibson-Mount <qu...@zimbra.com>.

--On Wednesday, August 27, 2014 6:06 PM -0400 btb 
<li...@bitrate.net> wrote:

> hi-
>
> we have a system [zimbra] where users can select a message in the mua
> interface and click a spam or not spam button.  this generates a message
> [containing the selected message] which is ultimately delivered to a
> mailbox.  i intend on retrieving these messages via imap and feeding
> sa-learn, but they've been a bit adulterated by the time they're
> retrieved, and i believe some cleanup is probably necessary prior to
> feeding sa-learn.

That seems rather convoluted, given that Zimbra already trains its SA 
database automatically on a nightly basis based on the messages user submit 
via marking things as Spam.  Are you running your own SA outside of Zimbra?

--Quanah


--

Quanah Gibson-Mount
Server Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration