You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by btb <li...@bitrate.net> on 2014/08/27 23:06:10 UTC
sanitizing/normalizing messages for feeding sa-learn
hi-
we have a system [zimbra] where users can select a message in the mua
interface and click a spam or not spam button. this generates a message
[containing the selected message] which is ultimately delivered to a
mailbox. i intend on retrieving these messages via imap and feeding
sa-learn, but they've been a bit adulterated by the time they're
retrieved, and i believe some cleanup is probably necessary prior to
feeding sa-learn.
here are two samples:
http://dpaste.com/0B6S3FN.txt [claimed to be spam]
http://dpaste.com/3ZZ733Z.txt [claimed to be not spam]
the original message is encapsulated as an attachment, so i was planning
on extracting this and discarding the rest of the message - unless
sa-learn is magical enough to handle this?
aside from that, i've read
https://wiki.apache.org/spamassassin/BayesInSpamAssassin and man 1
sa-learn about spamassassin markup/headers, but would appreciate any
feedback for the above samples that might be pertinent - particular
headers that i may not have considered removing, etc.
thanks
-ben
Re: sanitizing/normalizing messages for feeding sa-learn
Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
On 27.08.14 17:06, btb wrote:
>we have a system [zimbra] where users can select a message in the mua
>interface and click a spam or not spam button. this generates a
>message [containing the selected message] which is ultimately
>delivered to a mailbox. i intend on retrieving these messages via
>imap and feeding sa-learn, but they've been a bit adulterated by the
>time they're retrieved, and i believe some cleanup is probably
>necessary prior to feeding sa-learn.
Should not be that necessary. Hopefully Zimbra does not alter messages as
bad as Outlook/Exchange does (what should I tell you? I've been trying to
block spam with specific address in From: ... after I blocked according to
the Subject, I found out that real From: is very different)
>here are two samples:
>
>http://dpaste.com/0B6S3FN.txt [claimed to be spam]
>http://dpaste.com/3ZZ733Z.txt [claimed to be not spam]
>
>the original message is encapsulated as an attachment, so i was
>planning on extracting this and discarding the rest of the message -
>unless sa-learn is magical enough to handle this?
it is not, but extracting original message should be enough.
>aside from that, i've read
>https://wiki.apache.org/spamassassin/BayesInSpamAssassin and man 1
>sa-learn about spamassassin markup/headers, but would appreciate any
>feedback for the above samples that might be pertinent - particular
>headers that i may not have considered removing, etc.
I would remove no headers, SA should handle that properly.
--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Your mouse has moved. Windows NT will now restart for changes to take
to take effect. [OK]
Re: sanitizing/normalizing messages for feeding sa-learn
Posted by li...@bitrate.net.
On Aug 27, 2014, at 18.13, Quanah Gibson-Mount <qu...@zimbra.com> wrote:
> --On Wednesday, August 27, 2014 6:06 PM -0400 btb <li...@bitrate.net> wrote:
>
>> hi-
>>
>> we have a system [zimbra] where users can select a message in the mua
>> interface and click a spam or not spam button. this generates a message
>> [containing the selected message] which is ultimately delivered to a
>> mailbox. i intend on retrieving these messages via imap and feeding
>> sa-learn, but they've been a bit adulterated by the time they're
>> retrieved, and i believe some cleanup is probably necessary prior to
>> feeding sa-learn.
>
> That seems rather convoluted, given that Zimbra already trains its SA database automatically on a nightly basis based on the messages user submit via marking things as Spam. Are you running your own SA outside of Zimbra?
yes, our mx/mta/msa/content filtering infrastructure is completely separate from zimbra.
-ben
Re: sanitizing/normalizing messages for feeding sa-learn
Posted by Quanah Gibson-Mount <qu...@zimbra.com>.
--On Wednesday, August 27, 2014 6:06 PM -0400 btb
<li...@bitrate.net> wrote:
> hi-
>
> we have a system [zimbra] where users can select a message in the mua
> interface and click a spam or not spam button. this generates a message
> [containing the selected message] which is ultimately delivered to a
> mailbox. i intend on retrieving these messages via imap and feeding
> sa-learn, but they've been a bit adulterated by the time they're
> retrieved, and i believe some cleanup is probably necessary prior to
> feeding sa-learn.
That seems rather convoluted, given that Zimbra already trains its SA
database automatically on a nightly basis based on the messages user submit
via marking things as Spam. Are you running your own SA outside of Zimbra?
--Quanah
--
Quanah Gibson-Mount
Server Architect
Zimbra, Inc.
--------------------
Zimbra :: the leader in open source messaging and collaboration