You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Dan Barker <db...@visioncomm.net> on 2004/10/27 19:59:36 UTC

SA-Learn input format?

I've been running SA for about a week now, and need to sa-(un)learn the FPs.

My system is Windoze/IMail (5sp4/8.13) and the harry and susan (shouldn't
call them Ham and Spam, should I) folders contain all mis-identified email
in one giant flat file each.

Does this work?

Must I bust them up into separate emails before calling sa-learn?

The doc mentions the folders but says diddly-squat/infinity about the
contents of those folders.

Dan Barker

Format of a big flat file:

>>From <db...@visioncomm.net> Thu Oct 21 17:17:58 2004
Received: from dan [172.27.0.30] by visioncomm.net with ESMTP
  (SMTPD32-8.13) id A7823A3001E; Thu, 21 Oct 2004 17:17:54 -0400
From: "Dan Barker" <db...@visioncomm.net>
To: <su...@visioncomm.net>
... rest of headers

<HTML>
<TITLE></TITLE>
<BODY >
... rest of message

>>From <db...@visioncomm.net> Thu Oct 21 17:44:42 2004
Received: from dan [172.27.0.30] by visioncomm.net with ESMTP
  (SMTPD32-8.13) id ADCA1BD007C; Thu, 21 Oct 2004 17:44:42 -0400
From: "Dan Barker" <db...@visioncomm.net>
To: <su...@visioncomm.net>
Subject: Dbarker, Served in the MlLlTARY?
... rest of headers

This is a multi-part message in MIME format.

------=_NextPart_000_03B4_01C4B795.A68C01E0
Content-Type: text/plain;
        charset="us-ascii"
Content-Transfer-Encoding: 7bit
... rest of message


...
... for every email in the "box".



The Headers stop and Body begins on the first blank line.

I haven't figured out how the body ends yet. It appears to be the "From < in
column 1". Yeah, that's it. I just ran a test with "From <" in column 1, and
the email is stored with ">From <" instead. So, a splitter will be trivial
to write, but must I?




Re: SA-Learn input format?

Posted by Theo Van Dinter <fe...@kluge.net>.
On Wed, Oct 27, 2004 at 01:59:36PM -0400, Dan Barker wrote:
> My system is Windoze/IMail (5sp4/8.13) and the harry and susan (shouldn't
> call them Ham and Spam, should I) folders contain all mis-identified email
> in one giant flat file each.
> 
> Does this work?

If the file format is correct, sure.

> The doc mentions the folders but says diddly-squat/infinity about the
> contents of those folders.

Well, it does actually.  sa-learn supports mbox and mbx files.

> Format of a big flat file:
> 
> >From <db...@visioncomm.net> Thu Oct 21 17:17:58 2004
[...]
> >From <db...@visioncomm.net> Thu Oct 21 17:44:42 2004

this is almost mbox, except the mbox separator is escaped which won't work.

> the email is stored with ">From <" instead. So, a splitter will be trivial
> to write, but must I?

You could rewrite the ">From <...>" to be "From <...>", and then it's
apparently just an mbox file. :)

-- 
Randomly Generated Tagline:
"When all else fails, kick with lunar boot."      - James Burke