You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Ed Flecko <ed...@gmail.com> on 2012/11/27 00:01:51 UTC

Provide sa-learn with a CSV file of spam and ham?

Hi folks,
I'm running SpamAssassin version 3.3.2 (running on Perl version
5.14.2) on FreeBSD 9.0.

I've exported a bunch of spam and ham messages from my Baracuda 400.

I have an Excel .csv file of about 2500 spam messages and 2500 ham
messages, and I'm wondering if I can supply those as a parameter to
sa-learn? I've looked at the documentation
(http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html) and I
see that you can pass the file as a parameter, but I'm not clear how
you'd do that and in what format the file needs to be? CAN it be a
.csv or should it be something else?

I'm new to spamassassin, but (for those of you more familiar with the
product), "teaching" spamassassin is TYPICALLY the first thing one
would do before deploying it in a production environment, wouldn't
you?

Thank you,

Ed

Re: Provide sa-learn with a CSV file of spam and ham?

Posted by John Hardin <jh...@impsec.org>.
On Mon, 26 Nov 2012, John Hardin wrote:

> On Mon, 26 Nov 2012, Ed Flecko wrote:
>
>>  Hi folks,
>>  I'm running SpamAssassin version 3.3.2 (running on Perl version
>>  5.14.2) on FreeBSD 9.0.
>>
>>  I've exported a bunch of spam and ham messages from my Baracuda 400.
>
> What format did the Barracuda export the messages in? It might be possible to 
> directly feed that to sa-learn if it exported them in one of the "standard" 
> mailbox formats.
>
>>  I have an Excel .csv file of about 2500 spam messages and 2500 ham
>>  messages, and I'm wondering if I can supply those as a parameter to
>>  sa-learn? I've looked at the documentation
>>  (http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html) and I
>>  see that you can pass the file as a parameter, but I'm not clear how
>>  you'd do that and in what format the file needs to be? CAN it be a
>>  .csv or should it be something else?
>
> sa-learn expects either Berkeley-style mailbox files (i.e. RFC-822-format 
> messages separated by "From {stuff about sender}", or mbox 
> one-message-per-file format.

Oops. "maildir one-message-per-file format." Sorry.

> If your mailboxes aren't hosted on Windows, then 
> take a look at your inbox file in a text editor to get an idea of the file 
> format. (try "vi $MAIL" if you use vi)
>
>>  I'm new to spamassassin, but (for those of you more familiar with the
>>  product), "teaching" spamassassin is TYPICALLY the first thing one
>>  would do before deploying it in a production environment, wouldn't
>>  you?
>
> Not necessarily the first thing, but certainly done early on. SA does fairly 
> well without Bayes, especially if you have DNSBLs and URIBLs enabled, so you 
> don't necessarily need to get it trained before turning it on in production. 
> You can cut down on spam while getting it trained up.
>
> You should turn off autolearn until you've trained it and are sure bayes is 
> giving good results.
>
> --
>  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
>  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
>  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
>   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
>   does quite what I want. I wish Christopher Robin was here."
>                                           -- Peter da Silva in a.s.r
> -----------------------------------------------------------------------
>  29 days until Christmas
>

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
   does quite what I want. I wish Christopher Robin was here."
                                            -- Peter da Silva in a.s.r
-----------------------------------------------------------------------
  29 days until Christmas

Re: Provide sa-learn with a CSV file of spam and ham?

Posted by John Hardin <jh...@impsec.org>.
On Mon, 26 Nov 2012, Ed Flecko wrote:

> Hi folks,
> I'm running SpamAssassin version 3.3.2 (running on Perl version
> 5.14.2) on FreeBSD 9.0.
>
> I've exported a bunch of spam and ham messages from my Baracuda 400.

What format did the Barracuda export the messages in? It might be possible 
to directly feed that to sa-learn if it exported them in one of the 
"standard" mailbox formats.

> I have an Excel .csv file of about 2500 spam messages and 2500 ham
> messages, and I'm wondering if I can supply those as a parameter to
> sa-learn? I've looked at the documentation
> (http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html) and I
> see that you can pass the file as a parameter, but I'm not clear how
> you'd do that and in what format the file needs to be? CAN it be a
> .csv or should it be something else?

sa-learn expects either Berkeley-style mailbox files (i.e. RFC-822-format 
messages separated by "From {stuff about sender}", or mbox 
one-message-per-file format. If your mailboxes aren't hosted on Windows, 
then take a look at your inbox file in a text editor to get an idea of the 
file format. (try "vi $MAIL" if you use vi)

> I'm new to spamassassin, but (for those of you more familiar with the
> product), "teaching" spamassassin is TYPICALLY the first thing one
> would do before deploying it in a production environment, wouldn't
> you?

Not necessarily the first thing, but certainly done early on. SA does 
fairly well without Bayes, especially if you have DNSBLs and URIBLs 
enabled, so you don't necessarily need to get it trained before turning it 
on in production. You can cut down on spam while getting it trained up.

You should turn off autolearn until you've trained it and are sure bayes 
is giving good results.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
   does quite what I want. I wish Christopher Robin was here."
                                            -- Peter da Silva in a.s.r
-----------------------------------------------------------------------
  29 days until Christmas

Re: Provide sa-learn with a CSV file of spam and ham?

Posted by da...@chaosreigns.com.
--mbox                Input sources are in mbox format
 --mbx                 Input sources are in mbx format

--folders=filename, -f filename

    sa-learn will read in the list of folders from the specified file, one folder per line in the file. If the folder is prefixed with ham:type: or spam:type:, sa-learn will learn that folder appropriately, otherwise the folders will be assumed to be of the type specified by --ham or --spam.

    type above is optional, but is the same as the standard for ArchiveIterator: mbox, mbx, dir, file, or detect (the default if not specified).

 - http://spamassassin.apache.org/full/3.3.x/doc/sa-learn.html

So you can specify an input format of mbox, mbx, dir (maildir), file, or
detect.   Looks like no csv.


I'd guess a lot of people use spamassassin without bayes.

On 11/26, Ed Flecko wrote:
> Hi folks,
> I'm running SpamAssassin version 3.3.2 (running on Perl version
> 5.14.2) on FreeBSD 9.0.
> 
> I've exported a bunch of spam and ham messages from my Baracuda 400.
> 
> I have an Excel .csv file of about 2500 spam messages and 2500 ham
> messages, and I'm wondering if I can supply those as a parameter to
> sa-learn? I've looked at the documentation
> (http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html) and I
> see that you can pass the file as a parameter, but I'm not clear how
> you'd do that and in what format the file needs to be? CAN it be a
> .csv or should it be something else?
> 
> I'm new to spamassassin, but (for those of you more familiar with the
> product), "teaching" spamassassin is TYPICALLY the first thing one
> would do before deploying it in a production environment, wouldn't
> you?
> 
> Thank you,
> 
> Ed
> 

-- 
"Hermes will help you get your wagon unstuck, but only if you push on it."
- Greek Alphabet Oracle
http://www.ChaosReigns.com