You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Matus UHLAR - fantomas <uh...@fantomas.sk> on 2022/10/11 09:38:18 UTC

RFH: using SOUGHT logic to combat phish

Hello,

I have quite pretty archive of phish mail (bank and mail accounts), where 
many words and phrases repeat.

I was thinking about processing them manually and creating rules, but that 
would be much work. 

I remember that SOUGHT ruleset used to contain phrases that appear 
repeatedly, so I'd try to use these, if possible.

so far I found:
- description how it works https://taint.org/2007/03/05/134447a.html
- scripts to search in corpus:
   https://svn.apache.org/repos/asf/spamassassin/trunk/masses/rule-dev/seek-phrases-in-corpus

which seems to use plugins (Dumptext.pm, GrepRenderedBody.pm) I found at: 
https://svn.apache.org/repos/asf/spamassassin/branches/3.3/masses/plugins/


Are these still working or do they have any new versions?

Does anyone have hints how to process phish archive?

I mean, I apparently could manually weed out any repeated non-phish phrases 
to avoid FPs or check them manually what mail they hit, so I didn't need to 
keep much of ham mail

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Due to unexpected conditions Windows 2000 will be released
in first quarter of year 1901

Re: RFH: using SOUGHT logic to combat phish

Posted by "Kevin A. McGrail" <km...@apache.org>.

On 10/11/2022 5:38 AM, Matus UHLAR - fantomas wrote:
> Are these still working or do they have any new versions?
>
> Does anyone have hints how to process phish archive?
>
> I mean, I apparently could manually weed out any repeated non-phish 
> phrases to avoid FPs or check them manually what mail they hit, so I 
> didn't need to keep much of ham mail 
There was so interesting in a SOUGHT2 but no, the tooling hasn't been 
looked at in some time.  It would show promise if you want to dig into it!

-- 
Kevin A. McGrail
KMcGrail@Apache.org

Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171

Re: RFH: using SOUGHT logic to combat phish

Posted by Kris Deugau <kd...@vianet.ca>.

Matus UHLAR - fantomas wrote:
> Hello,
> 
> I have quite pretty archive of phish mail (bank and mail accounts), 
> where many words and phrases repeat.
> 
> I was thinking about processing them manually and creating rules, but 
> that would be much work.
> I remember that SOUGHT ruleset used to contain phrases that appear 
> repeatedly, so I'd try to use these, if possible.
> 
> so far I found:
> - description how it works https://taint.org/2007/03/05/134447a.html
> - scripts to search in corpus:
>    
> https://svn.apache.org/repos/asf/spamassassin/trunk/masses/rule-dev/seek-phrases-in-corpus 
> 
> 
> which seems to use plugins (Dumptext.pm, GrepRenderedBody.pm) I found 
> at: 
> https://svn.apache.org/repos/asf/spamassassin/branches/3.3/masses/plugins/
> 
> 
> Are these still working or do they have any new versions?

I'm a little hazy on the deep internals, but all the parts are still in 
SVN trunk.  I've been using this locally with a growing collection of 
configuration wrapper to generate a number of rule sets for different 
subgroups of spam.

I've just tried a test in a current trunk checkout and everything seems 
to work without issue.  Some components may need a little more tweaking 
for local conditions.

> Does anyone have hints how to process phish archive?
> 
> I mean, I apparently could manually weed out any repeated non-phish 
> phrases to avoid FPs or check them manually what mail they hit, so I 
> didn't need to keep much of ham mail

The minimal setup is to modify 
masses/rule-dev/sought/example_backend/run for your local pathnames, and 
change the rule fragment names however you like, and run that script. 
I've attached a patch showing my own changes for my quick test above.

You *do* need a collection of ham, however;  as-is it relies on that to 
weed out patterns you don't want to actually be firing on as well has 
sorting/grouping the patterns by hit-rate thresholds.

You could probably still use one of the intermediate files to bootstrap 
what you might have done manually, but you risk including poor patterns 
(either those that don't hit much, or also hit ham).

-kgd