You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Paul Stead <pa...@zeninternet.co.uk> on 2016/05/24 14:55:50 UTC

SA Concepts - plugin for email semantics

Hi guys,

Based upon some information from others on the list I have put together
a plugin for SA which canonicalises an email into it's basic "concepts".
Concepts are converted to tags, which Bayes can use as tokens to further
help identify spammy/hammy characteristics

Here are some examples of tags from some emails today -

---8<---
X-SA-Concepts: experience regards money optout time-ref dear great home
request member enjoy woman-adj important online click all-rights
email-adr please price best hot-adj
X-SA-Concepts: experience contact optout winner time-ref survey dear
home privacy prize store thankyou important click gift chance please
X-SA-Concepts: google law search-eng optout amazing order facebook
goodtime privacy lotsofmoney request enjoy details service partner
linkedin twitter trust contact time-ref great online click shop
email-adr please customer newsletter news
X-SA-Concepts: photos view-online money contact optout time-ref cost
reply2me service details online click please
X-SA-Concepts: friend hotwords trust experience regards contact time-ref
medical woman drugs consultant pill mailto woman-adj secret health earn
email-adr please security hot-adj day-of-week
X-SA-Concepts: https mailto re euros regards money youtube invoice
email-adr facebook best hair
---8<---

This plugin essentially adds an extra layer between the raw input
characteristics and recognition types - allowing clustering of different
characteristics to a more generic type - in effect giving Bayes more of
a two-layer neural network approach.

When combined with Bayes learning these email semantics (or Concepts)
can then be combined with the multiple other characteristics of that
email, to then be compared to other email that came before it.

https://github.com/fmbla/spamassassin-concepts

I'd be really interested to hear your feedback/thoughts on this system
and it's approach.

Paul
--
Paul Stead
Systems Engineer
Zen Internet

Re: SA Concepts - plugin for email semantics

Posted by Reindl Harald <h....@thelounge.net>.

Am 29.05.2016 um 02:46 schrieb Dianne Skoll:
> And also, two-word phrases can be stronger indicators than the
> individual words; "hot" and "sex" in isolation may not be strong spam
> indicators, but "hot sex" probably is stronger.
>
> Going from one-word tokens to one+two-word tokens will have a pretty
> big payoff, I think.  I'm not so sure about two to three

+1

the best result for many of the sort spams which try to defeat bayes 
would be 2 or 3 word tokes - we complement bayes with currently 1500 
handcrafted body rules with scores of 0.5/1.5/2.5/3.5/4.5 points

the majority of that rules have 2 or 3 words

the current toekns should stay as the are and *additional* 2-word tokens 
of the same messages - that would boost bayes to a completly different 
level with enough training data

one word tokens are limited in many ways (while it work not bad to say)


Re: SA Concepts - plugin for email semantics

Posted by Dianne Skoll <df...@roaringpenguin.com>.
On Sat, 28 May 2016 14:53:15 -0700 (PDT)
John Hardin <jh...@impsec.org> wrote:

> Based on that, do you have an opinion on the proposal to add two-word
> (or configurable-length) combinations to Bayes?

I have an opinion. :)

Extending Bayes to look at multiple tokens is a *very* good idea.
That's because naive single-word Bayes assumes that the probability of
a token is indepdendent of the presence of other tokens.  But this is
very rarely the case.  For example, the word "mussel" is substantially
more likely to follow the word "zebra" than it is to follow the word
"xenophobic".  So "zebra mussel" might be a couple of ecologists
talking, while "xenophobic mussel" could well be random text designed
to confuse Bayes.

And also, two-word phrases can be stronger indicators than the
individual words; "hot" and "sex" in isolation may not be strong spam
indicators, but "hot sex" probably is stronger.

Going from one-word tokens to one+two-word tokens will have a pretty
big payoff, I think.  I'm not so sure about two to three.

Regards,

Dianne.

Re: SA Concepts - plugin for email semantics

Posted by RW <rw...@googlemail.com>.
On Tue, 31 May 2016 12:05:39 -0400
Bill Cole wrote:

> On 31 May 2016, at 2:21, Henrik K wrote:
> 
> > On Mon, May 30, 2016 at 06:25:08PM -0400, Dianne Skoll wrote:  
> >> On Mon, 30 May 2016 17:45:52 -0400
> >> "Bill Cole" <sa...@billmail.scconsult.com> wrote:
> >>  
> >>> So you could have 'sex' and 'meds' and 'watches' tallied up in
> >>> into frequency counts that sum up natural (word) and synthetic
> >>> (concept) occurrences, not just as incompatible types of input
> >>> feature but as a conflation of incompatible features.  
> >>
> >> That is easy to patch by giving "concepts" a separate namespace.
> >> You could do that by picking a character that can't be in a normal
> >> token and
> >> using something like:  concept*meds, concept*sex, etc. as tokens.  
> >
> > This is how the put_metadata stuff already works in concepts and
> > other plugins. It sees a "Hx-sa-concepts:foobar" token.  
> 
> That's less bad than the description Paul Stead originally gave,
> which was to add headers with various simple word tags "which Bayes
> can use as tokens." If the actual implementation is doing something
> else in a separate Bayes DB, I don't see a problem with it (although
> I'd expect it to be less accurate than 1-word Bayes)

It's not in a separate database, it's just that words in headers
generate distinct tokens from words in the body. 

Re: SA Concepts - plugin for email semantics

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 31 May 2016, at 2:21, Henrik K wrote:

> On Mon, May 30, 2016 at 06:25:08PM -0400, Dianne Skoll wrote:
>> On Mon, 30 May 2016 17:45:52 -0400
>> "Bill Cole" <sa...@billmail.scconsult.com> wrote:
>>
>>> So you could have 'sex' and 'meds' and 'watches' tallied up in into
>>> frequency counts that sum up natural (word) and synthetic (concept)
>>> occurrences, not just as incompatible types of input feature but as
>>> a conflation of incompatible features.
>>
>> That is easy to patch by giving "concepts" a separate namespace.  You
>> could do that by picking a character that can't be in a normal token 
>> and
>> using something like:  concept*meds, concept*sex, etc. as tokens.
>
> This is how the put_metadata stuff already works in concepts and other
> plugins. It sees a "Hx-sa-concepts:foobar" token.

That's less bad than the description Paul Stead originally gave, which 
was to add headers with various simple word tags "which Bayes can use as 
tokens." If the actual implementation is doing something else in a 
separate Bayes DB, I don't see a problem with it (although I'd expect it 
to be less accurate than 1-word Bayes)

Re: SA Concepts - plugin for email semantics

Posted by Henrik K <he...@hege.li>.
On Mon, May 30, 2016 at 06:25:08PM -0400, Dianne Skoll wrote:
> On Mon, 30 May 2016 17:45:52 -0400
> "Bill Cole" <sa...@billmail.scconsult.com> wrote:
> 
> > So you could have 'sex' and 'meds' and 'watches' tallied up in into
> > frequency counts that sum up natural (word) and synthetic (concept)
> > occurrences, not just as incompatible types of input feature but as
> > a conflation of incompatible features.
> 
> That is easy to patch by giving "concepts" a separate namespace.  You
> could do that by picking a character that can't be in a normal token and
> using something like:  concept*meds, concept*sex, etc. as tokens.

This is how the put_metadata stuff already works in concepts and other
plugins. It sees a "Hx-sa-concepts:foobar" token.


Re: SA Concepts - plugin for email semantics

Posted by RW <rw...@googlemail.com>.
On Mon, 30 May 2016 17:45:52 -0400
Bill Cole wrote:


> The "Naive Bayes" classification approach is theoretically moored to 
> Bayes' Theorem 

FWIW Bayes hasn't been "Naive Bayes" for a long time.

Re: SA Concepts - plugin for email semantics

Posted by Reindl Harald <h....@thelounge.net>.

Am 31.05.2016 um 02:30 schrieb Bill Cole:
> On 30 May 2016, at 18:25, Dianne Skoll wrote:
>
>> On Mon, 30 May 2016 17:45:52 -0400
>> "Bill Cole" <sa...@billmail.scconsult.com> wrote:
>>
>>> So you could have 'sex' and 'meds' and 'watches' tallied up in into
>>> frequency counts that sum up natural (word) and synthetic (concept)
>>> occurrences, not just as incompatible types of input feature but as
>>> a conflation of incompatible features.
>>
>> That is easy to patch by giving "concepts" a separate namespace.  You
>> could do that by picking a character that can't be in a normal token and
>> using something like:  concept*meds, concept*sex, etc. as tokens.
>
> Yes, but I'd still be reluctant to have that namespace directly blended
> with 1-word Bayes because those "concepts" are qualitatively different:
> inherently much more complex in their measurement than words. Robotic
> semantic analysis hasn't reached the point where an unremarkable machine
> can decide whether a message is porn or a discussion of current
> political issues, and I would not hazard a guess as to which actual
> concept in email is more likely to be spam or ham these days. Any old
> mail server can of course tell whether the word 'Carolina' is present in
> a message, which probably distributes quite disproportionately towards ham

was the difference having two token "hot" and "sex" versus 3 tokens 
"hot", "sex" and "hot sex" for bayes classification?


Re: SA Concepts - plugin for email semantics

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 30 May 2016, at 18:25, Dianne Skoll wrote:

> On Mon, 30 May 2016 17:45:52 -0400
> "Bill Cole" <sa...@billmail.scconsult.com> wrote:
>
>> So you could have 'sex' and 'meds' and 'watches' tallied up in into
>> frequency counts that sum up natural (word) and synthetic (concept)
>> occurrences, not just as incompatible types of input feature but as
>> a conflation of incompatible features.
>
> That is easy to patch by giving "concepts" a separate namespace.  You
> could do that by picking a character that can't be in a normal token 
> and
> using something like:  concept*meds, concept*sex, etc. as tokens.

Yes, but I'd still be reluctant to have that namespace directly blended 
with 1-word Bayes because those "concepts" are qualitatively different: 
inherently much more complex in their measurement than words. Robotic 
semantic analysis hasn't reached the point where an unremarkable machine 
can decide whether a message is porn or a discussion of current 
political issues, and I would not hazard a guess as to which actual 
concept in email is more likely to be spam or ham these days. Any old 
mail server can of course tell whether the word 'Carolina' is present in 
a message, which probably distributes quite disproportionately towards 
ham.

>> FWIW, I have roughly no free time for anything between work and
>> family demands but if I did, I would most like to build a blind
>> fixed-length tokenization Bayes classifier: just slice up a message
>> into all of its n-byte sequences (so that a message of bytelength x
>> would have x-(n-1) different tokens) and use those as inputs instead
>> of words.
>
> I think that could be very effective with (as you said) plenty of
> training.  I think there *may* be slight justification for
> canonicalizing text parts into utf-8 first; while you are losing
> information, it's hard to see how \u624b\u673a\u8272\u60c5 should be treated
> differently depending on the character encoding.

Well, I've not thought it through deeply, but an evasion of the charset 
issue might be to just decode any Base64 or QP transfer encoding (which 
can be path-dependent rather than a function of the sender or content) 
to get 8-bit bytes and use 6-byte tokens as if it was all 1-byte chars. 
UCS-4 messages would be a wreck, but pairs of non-ASCII chars in UTF-8 
would be seen cleanly once and as an aura of 10 semi-junk tokens around 
them, in a manner that might effectively wash itself out. Or go to 
12-byte tokens and get the same effect with UCS-4. Or 3-byte tokens: 
screw 32-bit charsets, screw encoding semantics of UTF-8, just have 16.8 
million possible 24-bit tokens and see how they distribute. It seems to 
me that this is almost the ultimate test for Naive Bayes text analysis: 
break away from the idea that the input features have any innate meaning 
at all, let them be pure proxies for whatever complex larger patterns 
give rise to them.

Oh, and did I mention that Bayes' Theorem has different 
"interpretations" in the same way Heisenberg's Uncertainty Principle and 
quantum superposition do? 24-bit tokens could settle the dispute...

Re: SA Concepts - plugin for email semantics

Posted by Dianne Skoll <df...@roaringpenguin.com>.
On Mon, 30 May 2016 17:45:52 -0400
"Bill Cole" <sa...@billmail.scconsult.com> wrote:

> So you could have 'sex' and 'meds' and 'watches' tallied up in into
> frequency counts that sum up natural (word) and synthetic (concept)
> occurrences, not just as incompatible types of input feature but as
> a conflation of incompatible features.

That is easy to patch by giving "concepts" a separate namespace.  You
could do that by picking a character that can't be in a normal token and
using something like:  concept*meds, concept*sex, etc. as tokens.

> FWIW, I have roughly no free time for anything between work and
> family demands but if I did, I would most like to build a blind
> fixed-length tokenization Bayes classifier: just slice up a message
> into all of its n-byte sequences (so that a message of bytelength x
> would have x-(n-1) different tokens) and use those as inputs instead
> of words.

I think that could be very effective with (as you said) plenty of
training.  I think there *may* be slight justification for
canonicalizing text parts into utf-8 first; while you are losing
information, it's hard to see how 手机色情 should be treated
differently depending on the character encoding.

Regards,

Dianne.

Re: SA Concepts - plugin for email semantics

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 28 May 2016, at 17:53, John Hardin wrote:

> Based on that, do you have an opinion on the proposal to add two-word 
> (or configurable-length) combinations to Bayes?

CAVEAT: it has literally been decades since I've worked deep in 
statistics on a routine basis rather than just using blindly trusted 
black-box tools every now and then, so some of the below could be 
influenced by senile dementia...

Tallying word pairs *instead* of single words or as a second discrete 
Bayes analysis wouldn't be a problem and would surely be useful, 
possibly more useful than single-word analysis.

Doing one unified analysis where single words and multi-word phrases are 
both tallied in one Bayes DB to determine one Bayes score is less 
clearly valid because there is absolute dependence in one direction: the 
presence of any phrase requires its component words also to be present. 
OTOH, whether sets of words that are commonly used in particular 
sequences occur independently with or without matching those sequences 
is pretty clearly an independent feature of a text not captured by 
1-word tokenization, so it wouldn't be blatantly wrong to capture it 
indirectly by having a unified word and phrase Bayes DB. So I guess I'm 
undecided, leaning in favor because it captures information otherwise 
invisible to the Bayes DB.

The "Naive Bayes" classification approach is theoretically moored to 
Bayes' Theorem by the concept that even if there's SOME dependent 
correlation across the features being measured to feed the 
classification database, incomplete dependency makes a large set of 
similar measurable features (like the presence of words in a message) 
usable as a proxy for a hypothetical set of truly independent features 
which are unknown and may not be readily quantified. For textual 
analysis, this ironically might be "concepts" but to be accurate that 
set would have to include a properly distributed sample of all possible 
concepts and a concrete way to detect each one accurately. Using words 
or n-word phrases instead of concepts means that Bayesian spam 
classification does not require a full-resolution simulation of Brahman 
on every mail server. Those are very resource-heavy...

The canonical empirical example of Naive Bayes classification is the use 
of simple physical body measurements to classify humans by biological 
sex. That classification improves as one adds more direct physical 
measurements, even though they all relate to each other via abstract 
ideas like "size," "muscularity," and "shape". However, if one includes 
such subjective abstractions, accuracy usually suffers (unless you cheat 
with features like 'femininity'.) Less intuitively, if one adds 
arbitrary derived features like BMI which can be calculated from the 
simpler measured features also in the input set, classification accuracy 
also is usually made worse. Perversely, classifiers using purely 
subjective abstractions or purely derived values such as various ratios 
of direct physical metrics work better on average than classifiers of 
mixed types, but can work better or worse than classifiers using the 
simple measurements on which the derived features are based. This is 
where the serious arguments about various Naive Bayes implementations 
arise: What constitutes features of compatible classes? How strong can a 
correlation between features be without effectively being measurements 
of the same thing twice? Is the empirical support for the idea of 
semi-independent features as proxies for truly independent features 
strong enough? Are the distributions of the predictive features and the 
classifications compatible with each other for Bayes or even for Bayes 
*AT ALL*?

The approach of mixing "concepts" into the existing Bayes DB is 
qualitatively broken because concept tokens would be deterministically 
derived from the actual word tokens in messages based on some subjective 
scheme and then added as words which are likely to also be be naturally 
occurring in some but not all of the messages to which they are added. 
So you could have 'sex' and 'meds' and 'watches' tallied up in into 
frequency counts that sum up natural (word) and synthetic (concept) 
occurrences, not just as incompatible types of input feature but as a 
conflation of incompatible features.


FWIW, I have roughly no free time for anything between work and family 
demands but if I did, I would most like to build a blind fixed-length 
tokenization Bayes classifier: just slice up a message into all of its 
n-byte sequences (so that a message of bytelength x would have x-(n-1) 
different tokens) and use those as inputs instead of words. An advantage 
to this over word-wise Bayes would be attenuation of semantic 
entanglement and better detection of intentional obfuscation, at the 
cost of needing huge training volume to get a usable classifier.

Re: SA Concepts - plugin for email semantics

Posted by John Hardin <jh...@impsec.org>.
On Sat, 28 May 2016, Bill Cole wrote:

> There is sound statistical theory consistent with empirical evidence 
> underpinning the Bayes classifier implementation in SA. While there can be 
> legitimate critiques of the SA implementation specifically and in general how 
> well email word frequency fits Bayes' Theorem, injecting a pile of new 
> derivative meta-tokens based on pre-conceived notions of "concepts" into the 
> Bayesian analysis invalidates the assumption of what the input for Naive 
> Bayes analysis is: *independent* features. The "concepts" approach adds words 
> that are *dependent* on the presence of other words in the document and to 
> make it worse, those dependent words may already exist in some pristine 
> messages. It unmoors the SA Bayes implementation from any theoretical 
> grounding, converting its complex math from statistical analysis into 
> arbitrary numerology.

Based on that, do you have an opinion on the proposal to add two-word (or 
configurable-length) combinations to Bayes?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Gun Control is marketed to the public using the appealing delusion
   that violent criminals will obey the law.
-----------------------------------------------------------------------
  2 days until Memorial Day - honor those who sacrificed for our liberty

Re: SA Concepts - plugin for email semantics

Posted by David Jones <dj...@ena.com>.
>From: RW <rw...@googlemail.com>
>Sent: Tuesday, May 31, 2016 5:20 PM
>To: users@spamassassin.apache.org
>Subject: Re: SA Concepts - plugin for email semantics

>On Tue, 31 May 2016 15:20:56 -0400
>Bill Cole wrote:

>> On 29 May 2016, at 11:07, RW wrote:
>>

>> > Statistical filters are based on some statistical theory combined
>> > with pragmatic kludges and assumptions. Practical filters have been
>> > developed based on what's been found to work, not on what's more
>> > statistically correct.
>>
>> I'm not aware of any hard evidence that the SA Bayes pragmatic
>> kludges and assumptions perform better or worse than an
>> implementation that used fewer or different ones.

>It's not specific to SA, for example there's no sound basis for
>assigning token probability to tokens that have zero ham or spam
>counts, many classifications turn on completely made-up probabilities.
>There's also no way of assigning meaningful probabilities to tokens
>that enter or re-enter the database while it's mature without making
>an assumption about the current spam/ham training ratio.

>The assumption that tokens are independent was never reasonable in the
>first place, there's plenty of natural duplication e.g. ip address and
>RDNS, and strong correlations between important tokens. There's also a
>lot of inadvertent duplication for example from metadata headers that
>are not primarily intended for Bayes.


>I don't think concepts is a particular good idea, but I don't like to
>see someone's worked dismissed on such paper-thin theoretical grounds.



>> > I think the OP is probably underselling it, in that it could be
>> > used to
>> > extract information that normal tokenization can't get, for example:
>> > ...
>> > The main problem is that you'd need a lot of rules to make a
>> > substantial
>> > difference.
>>
>> So: re-invent SpamAssassin v1 but without rule scores, using Bayes to
>> do half-assed dynamic score adjustment per site with rules that with
>> either evolve constantly or grow stale?

>I was thinking that it would be an alternative to local custom rules
>- particularly for spams that leave Bayes with little to work with and
> where individual body rules aren't worth much of a score.

I think it could be valuable in custom meta rules.  That's how I would
like to try it out anyway for a while with minuscule scores.

Dave

Re: SA Concepts - plugin for email semantics

Posted by RW <rw...@googlemail.com>.
On Tue, 31 May 2016 15:20:56 -0400
Bill Cole wrote:

> On 29 May 2016, at 11:07, RW wrote:
> 

> > Statistical filters are based on some statistical theory combined
> > with pragmatic kludges and assumptions. Practical filters have been
> > developed based on what's been found to work, not on what's more
> > statistically correct.  
> 
> I'm not aware of any hard evidence that the SA Bayes pragmatic
> kludges and assumptions perform better or worse than an
> implementation that used fewer or different ones. 

It's not specific to SA, for example there's no sound basis for
assigning token probability to tokens that have zero ham or spam
counts, many classifications turn on completely made-up probabilities.
There's also no way of assigning meaningful probabilities to tokens
that enter or re-enter the database while it's mature without making
an assumption about the current spam/ham training ratio.

The assumption that tokens are independent was never reasonable in the
first place, there's plenty of natural duplication e.g. ip address and
RDNS, and strong correlations between important tokens. There's also a
lot of inadvertent duplication for example from metadata headers that
are not primarily intended for Bayes.


I don't think concepts is a particular good idea, but I don't like to
see someone's worked dismissed on such paper-thin theoretical grounds. 



> > I think the OP is probably underselling it, in that it could be
> > used to
> > extract information that normal tokenization can't get, for example:
> > ...
> > The main problem is that you'd need a lot of rules to make a 
> > substantial
> > difference.  
> 
> So: re-invent SpamAssassin v1 but without rule scores, using Bayes to
> do half-assed dynamic score adjustment per site with rules that with
> either evolve constantly or grow stale?

I was thinking that it would be an alternative to local custom rules
- particularly for spams that leave Bayes with little to work with and
 where individual body rules aren't worth much of a score. 

Re: SA Concepts - plugin for email semantics

Posted by Dianne Skoll <df...@roaringpenguin.com>.
On Tue, 31 May 2016 21:23:11 +0100
Paul Stead <pa...@zeninternet.co.uk> wrote:

> The implementation was undertaken from a personal interest - I asked
> the question of what people thought of the implementation and the
> impact to Bayes DB.

I think what the "concepts" concept ends up doing is this: "concepts"
are more-or-less equivalent to SpamAssassin heuristics.  And instead
of using SpamAssassin's genetic algorithm to figure out the most
appropriate scores for each concept, you're letting Bayes tell you the
strength of each concept.

I do seem to recall a proposal to throw SpamAssassin test names into
Bayes, which would be interesting... it would be fascinating to see
the amount of agreement or divergence between the rules Bayes thinks
are important and the rules the genetic algorithm weights heavily.

Regards,

Dianne.

Re: SA Concepts - plugin for email semantics

Posted by Paul Stead <pa...@zeninternet.co.uk>.

On 31/05/16 20:20, Bill Cole wrote:
> It is no shock that while this implementation has Paul Stead's name on
> it, it is apparently mostly the product of the anti-spam community's
> most spectacular case of Dunning-Kruger Syndrome, who has apparently
> figured out that his personal 'brand' has negative value.

The implementation was undertaken from a personal interest - I asked the
question of what people thought of the implementation and the impact to
Bayes DB.
Thank you everyone for the feedback - I certainly didn't expect quite as
much!

I know this isn't an attack specifically towards me but at least this
got the conversational juices flowing and more ideas hammered out?

> The craziest part of this is that we already HAVE this functionality
> outside of the SA Bayes filter. It's called SpamAssassin. Perkel's
> concept files in Stead's plugin could be robotically translated into
> sub-rules and meta-rules, run through the normal Rules QA mechanism,
> and dynamically scored. There is no reason to hide this stuff behind
> Bayes where it would be mixing a jumble of derivative meta-tokens into
> a database of case-normalized primitive tokens, amplifying an
> arbitrary subset of the information already present in the Bayes DB.

I think a bit of time in the dev talk might be needed for me before I go
ahead with a "concept" - I didn't intend to come across as selling this
as the next FUSSP or anything - and I agreed with the initial responses
of the list, hence not carrying on the conversation further.

Paul
--
Paul Stead
Systems Engineer
Zen Internet

Re: SA Concepts - plugin for email semantics

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 29 May 2016, at 11:07, RW wrote:

> On Sat, 28 May 2016 15:37:21 -0400
> Bill Cole wrote:
>
>
>> More importantly (IMHO) they aren't designed to collide with existing
>> common tokens and be added back into messages that may contain those
>> tokens already in order to influence Bayesian classification.
>>
>> There is sound statistical theory consistent with empirical evidence
>> underpinning the Bayes classifier implementation in SA. While there
>> can be legitimate critiques of the SA implementation specifically and
>> in general how well email word frequency fits Bayes' Theorem,
>> injecting a pile of new derivative meta-tokens based on pre-conceived
>> notions of "concepts" into the Bayesian analysis invalidates the
>> assumption of what the input for Naive Bayes analysis is:
>> *independent* features. The "concepts" approach adds words that are
>> *dependent* on the presence of other words in the document and to
>> make it worse, those dependent words may already exist in some
>> pristine messages. It unmoors the SA Bayes implementation from any
>> theoretical grounding, converting its complex math from statistical
>> analysis into arbitrary numerology.
>
> Statistical filters are based on some statistical theory combined with
> pragmatic kludges and assumptions. Practical filters have been
> developed based on what's been found to work, not on what's more
> statistically correct.

I'm not aware of any hard evidence that the SA Bayes pragmatic kludges 
and assumptions perform better or worse than an implementation that used 
fewer or different ones. I confess that I have not actually *LOOKED* for 
such evidence in the past 6 years, so maybe you are aware of something I 
never could find simpoly because it didn't yet exist.

> Bayes already creates multiple tokens from the same information, most
> notably case-sensitive and lower-case words in the body.
>
> I don't see a huge difference between
>
>   "Bill Cole" tokenizing as {Bill, bill, Cole, cole}
>
> and
>
>   "v1agra, ciali5"  tokenizing as {v1agra, ciali5, eddrug}

De-capitalizing and full case-squashing have their own issues 
(particularly when one's lower-cased first name has 5 noun and 2 verb 
definitions...) but it is an invariant deterministic process for the 
most popular (so far) spamming languages. Today's de-capitalizing 
tokenization of English words is going to yield the same tokens today, 
tomorrow, and 3 years from now. There is a strong argument for 
de-capitalization in English because of our capitalization rules: 'Bill' 
could properly be a name or any of 7 noun & verb meanings as the first 
word of a sentence, while 'bill' is properly only one of those 7  
meanings NOT as the first word. Adding a de-capitalized token captures 
all of the occurrences of the most common (for most people) uses of the 
word in one token tally, making it more complete. There is no "throwing 
away information" in that process and also no invention of new 
meta-information in a way that might get updated in later analyses.

A Naive Bayes purist might insist that expanding capitalized forms into 
2 tokens means that you're double-counting one token and so 
overweighting capitalized words: words often used to start sentences and 
words that are sometimes proper nouns. I think that argument would be 
MUCH stronger in German, where variant capitalization would be a strong 
style signal.

My guess is that de-capitalization for English likely yields better 
results than if only the pristine words were used. I also think it could 
be useful to have a visual normalization algorithm that would turn 
"V1agra" into {V1agra, viagra} and "Unwi11!ng" into {Unwi11!ng, 
unwi11ing}. These are *guesses* on my part, but I think that can be 
rationalized by understanding that capitalization and intentional 
obfuscation that maintains the visual appearance of a word are 
effectively noise interfering with the words that the author intended 
the reader to see, so while it is important to retain capitalized or 
obfuscated forms for the information wrapped up in that formal 
difference, it is also correct to count them as the words they are 
intended to be.

Concepts are fundamentally different because there is no finite set of 
all concepts, no generally-accepted (or even suggested, as far as I 
know...) finite set of all commonly used concepts, no formal definition 
of how to divide broad topical areas into discrete concepts, no nothing 
but the vague fuzzy concept of concepts. The currently-offered 
implementation has 250 concept files, each consisting of arbitrary 
subjective pattern sets ranging from a single pattern matching a line 76 
characters or longer followed by a 76-character line (how is that a 
"concept???") to a woefully incomplete list of Apple brands. It seems 
unavoidable that the number of concepts will grow and the definition of 
existing "concepts" will change at an ongoing rapid pace such that 
hitting the '76char' concept(ugh) may be forever reliable, hitting the 
'asian' concept is surely going to need to be MUCH easier in the future 
than it currently is. It is no shock that while this implementation has 
Paul Stead's name on it, it is apparently mostly the product of the 
anti-spam community's most spectacular case of Dunning-Kruger Syndrome, 
who has apparently figured out that his personal 'brand' has negative 
value.

> The only way to find out whether it works is to try it.

Sure, but with innovations for the SA Bayes filter that seem to me to be 
in profound conflict with the theory of why the SA Bayes filter DOES 
work, I'm going to let others generate that evidence one way or the 
other.

The craziest part of this is that we already HAVE this functionality 
outside of the SA Bayes filter. It's called SpamAssassin. Perkel's 
concept files in Stead's plugin could be robotically translated into 
sub-rules and meta-rules, run through the normal Rules QA mechanism, and 
dynamically scored. There is no reason to hide this stuff behind Bayes 
where it would be mixing a jumble of derivative meta-tokens into a 
database of case-normalized primitive tokens, amplifying an arbitrary 
subset of the information already present in the Bayes DB.

> I think the OP is probably underselling it, in that it could be used 
> to
> extract information that normal tokenization can't get, for example:
>
> /%.off/i
>
> /Symbol:/i,   /Date:/i,    /Price:/i ...
>
> /^Barrister/i
>
>
>
> The main problem is that you'd need a lot of rules to make a 
> substantial
> difference.

So: re-invent SpamAssassin v1 but without rule scores, using Bayes to do 
half-assed dynamic score adjustment per site with rules that with either 
evolve constantly or grow stale?

Let me know how that goes...

Re: SA Concepts - plugin for email semantics

Posted by RW <rw...@googlemail.com>.
On Sat, 28 May 2016 15:37:21 -0400
Bill Cole wrote:


> More importantly (IMHO) they aren't designed to collide with existing 
> common tokens and be added back into messages that may contain those 
> tokens already in order to influence Bayesian classification.
> 
> There is sound statistical theory consistent with empirical evidence 
> underpinning the Bayes classifier implementation in SA. While there
> can be legitimate critiques of the SA implementation specifically and
> in general how well email word frequency fits Bayes' Theorem,
> injecting a pile of new derivative meta-tokens based on pre-conceived
> notions of "concepts" into the Bayesian analysis invalidates the
> assumption of what the input for Naive Bayes analysis is:
> *independent* features. The "concepts" approach adds words that are
> *dependent* on the presence of other words in the document and to
> make it worse, those dependent words may already exist in some
> pristine messages. It unmoors the SA Bayes implementation from any
> theoretical grounding, converting its complex math from statistical
> analysis into arbitrary numerology.

Statistical filters are based on some statistical theory combined with
pragmatic kludges and assumptions. Practical filters have been
developed based on what's been found to work, not on what's more
statistically correct.

Bayes already creates multiple tokens from the same information, most
notably case-sensitive and lower-case words in the body.

I don't see a huge difference between

  "Bill Cole" tokenizing as {Bill, bill, Cole, cole}

and

  "v1agra, ciali5"  tokenizing as {v1agra, ciali5, eddrug}

The only way to find out whether it works is to try it.

I think the OP is probably underselling it, in that it could be used to
extract information that normal tokenization can't get, for example:

/%.off/i

/Symbol:/i,   /Date:/i,    /Price:/i ...

/^Barrister/i



The main problem is that you'd need a lot of rules to make a substantial
difference.

Re: SA Concepts - plugin for email semantics

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 25 May 2016, at 13:15, Dianne Skoll wrote:

> On Wed, 25 May 2016 18:10:57 +0100
> Paul Stead <pa...@zeninternet.co.uk> wrote:
>
[quoting Dianne]
>>> "Concepts" is a lossy process.  You are throwing away information.
>> That is by design, similar to fingerprinting emails with iXhash or
>> Razor.
>
> iXhash and Razor are designed to detect mass-mailings of identical or
> very similar messages; they measure "bulk-ness" and not hammy-ness or
> spammy-ness directly.

More importantly (IMHO) they aren't designed to collide with existing 
common tokens and be added back into messages that may contain those 
tokens already in order to influence Bayesian classification.

There is sound statistical theory consistent with empirical evidence 
underpinning the Bayes classifier implementation in SA. While there can 
be legitimate critiques of the SA implementation specifically and in 
general how well email word frequency fits Bayes' Theorem, injecting a 
pile of new derivative meta-tokens based on pre-conceived notions of 
"concepts" into the Bayesian analysis invalidates the assumption of what 
the input for Naive Bayes analysis is: *independent* features. The 
"concepts" approach adds words that are *dependent* on the presence of 
other words in the document and to make it worse, those dependent words 
may already exist in some pristine messages. It unmoors the SA Bayes 
implementation from any theoretical grounding, converting its complex 
math from statistical analysis into arbitrary numerology.

Re: SA Concepts - plugin for email semantics

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>On Thu, 26 May 2016 12:20:35 +0200
>Matus UHLAR - fantomas <uh...@fantomas.sk> wrote:
>> you apparently mistook razor to DCC, the DCC is here to measure
>> bulkiness, but not (necessarily) spamminess.

On 26.05.16 09:46, Dianne Skoll wrote:
>Yes, you are correct.  Thanks for the clarification!
>
>And also, just to clarify another thing: Lossy procedures are not
>always bad.  SpamAssassin rules, for example, are by definition lossy
>but are quite effective.  However, in my experience, manipulating the
>data you feed into Bayes is rarely helpful and sometimes
>counterproductive, because Bayes on its own is very good at picking
>out what's important and what is not.

especially when it wipes off the differences between spam and ham :-)
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Emacs is a complicated operating system without good text editor.

Re: SA Concepts - plugin for email semantics

Posted by Dianne Skoll <df...@roaringpenguin.com>.
On Thu, 26 May 2016 12:20:35 +0200
Matus UHLAR - fantomas <uh...@fantomas.sk> wrote:

> you apparently mistook razor to DCC, the DCC is here to measure
> bulkiness, but not (necessarily) spamminess.

Yes, you are correct.  Thanks for the clarification!

And also, just to clarify another thing: Lossy procedures are not
always bad.  SpamAssassin rules, for example, are by definition lossy
but are quite effective.  However, in my experience, manipulating the
data you feed into Bayes is rarely helpful and sometimes
counterproductive, because Bayes on its own is very good at picking
out what's important and what is not.

Regards,

Dianne.


Re: SA Concepts - plugin for email semantics

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>> > Yes, except here's the problem.  A drug company might legitimately
>> > talk about Viagra, so that wouldn't be a spam token.  V1agra almost
>> > certainly would be a spam token.  Bayes can distinguish between the
>> > two; "concepts" cannot.

>On Wed, 25 May 2016 18:10:57 +0100
>Paul Stead <pa...@zeninternet.co.uk> wrote:
>> Bayes cannot make a relationship between V1agra and Viagra.

well, I see no reason why it should. In fact, I think it should not, so
"V1agra" would be handled separately from "Viagra".

>> > "Concepts" is a lossy process.  You are throwing away information.
>> That is by design, similar to fingerprinting emails with iXhash or
>> Razor.

On 25.05.16 13:15, Dianne Skoll wrote:
>iXhash and Razor are designed to detect mass-mailings of identical or
>very similar messages; they measure "bulk-ness" and not hammy-ness or
>spammy-ness directly.

Pardon me, but while I don't know ixhash, the razor DOES measure spamminess
- it's designed to receive (manual) spam reports.

you apparently mistook razor to DCC, the DCC is here to measure bulkiness,
but not (necessarily) spamminess.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Spam is for losers who can't get business any other way.

Re: SA Concepts - plugin for email semantics

Posted by Dianne Skoll <df...@roaringpenguin.com>.
On Wed, 25 May 2016 18:10:57 +0100
Paul Stead <pa...@zeninternet.co.uk> wrote:

> > Yes, except here's the problem.  A drug company might legitimately
> > talk about Viagra, so that wouldn't be a spam token.  V1agra almost
> > certainly would be a spam token.  Bayes can distinguish between the
> > two; "concepts" cannot.

> Bayes cannot make a relationship between V1agra and Viagra.

That's correct.  But does that have a detrimental effect on accuracy?
I bet it doesn't if you have a large enough corpus.

> > "Concepts" is a lossy process.  You are throwing away information.
> That is by design, similar to fingerprinting emails with iXhash or
> Razor.

iXhash and Razor are designed to detect mass-mailings of identical or
very similar messages; they measure "bulk-ness" and not hammy-ness or
spammy-ness directly.

[...]

> I agree this is becoming more of a problem - homoglyphs are another
> plug-in I'm also investigating...

Yes, now that could be really useful.

Regards,

Dianne.

Re: SA Concepts - plugin for email semantics

Posted by Paul Stead <pa...@zeninternet.co.uk>.

On 25/05/16 15:21, Dianne Skoll wrote:
> On Wed, 25 May 2016 15:07:37 +0100
> Paul Stead <pa...@zeninternet.co.uk> wrote:
>
>> Consider the following 2 basic emails:
>> Mail 1:
>> Viagra
>> Mail 2:
>> V1agra
> Yes, except here's the problem.  A drug company might legitimately
> talk about Viagra, so that wouldn't be a spam token.  V1agra almost
> certainly would be a spam token.  Bayes can distinguish between the
> two; "concepts" cannot.

Bayes cannot make a relationship between V1agra and Viagra. Without
Concepts the two emails have no relationship, so nothing can be weighed
about Mail 2 based from Mail 1.
Of course real email has other tokens we can base this relationship off
- either hammy or spammy - so a drugs company email will have other
positive traits in Bayes that the spam mail doesn't have.

> "Concepts" is a lossy process.  You are throwing away information.
That is by design, similar to fingerprinting emails with iXhash or
Razor. I guess the danger is making the digest too standard? A token of
50/50 isn't much use.
> Furthermore, "concepts" is playing a game of whack-a-mole as spammers
> come up with more creative misspellings and other variations on
> evading the concept detector.
That I cannot argue with, though the lack of a concept could be helpful?

eg Spam email has concepts of "meds", "pharmacy" and "dearstranger" -
the three appearing together is suspect (not forgetting other Bayes tokens)
Legit email has concepts of just "meds" and "pharmacy" - these two are
not suspicious alone (not forgetting other Bayes tokens)

This can be applied to the contrary as well. I'm thinking it's not about
the single concepts that are hit, but what other Bayes tokens (classic
and other Concepts) that the email also hits.

>    Do you really want to spend your days
> writing rules to detect:
>
> vіÅᏀʀâ, Ꮩɩɑɢᚱà, etc. and all the exponentially-numerous possible combinations?
I agree this is becoming more of a problem - homoglyphs are another
plug-in I'm also investigating, beyond ReplaceTags.

Paul
--
Paul Stead
Systems Engineer
Zen Internet

Re: SA Concepts - plugin for email semantics

Posted by Dianne Skoll <df...@roaringpenguin.com>.
On Wed, 25 May 2016 15:07:37 +0100
Paul Stead <pa...@zeninternet.co.uk> wrote:

> Consider the following 2 basic emails:

> Mail 1:
> Viagra

> Mail 2:
> V1agra

Yes, except here's the problem.  A drug company might legitimately
talk about Viagra, so that wouldn't be a spam token.  V1agra almost
certainly would be a spam token.  Bayes can distinguish between the
two; "concepts" cannot.

"Concepts" is a lossy process.  You are throwing away information.  It
probably helps a bit in small installations where there's not much
Bayes data to go on, but if you have a very large Bayes corpus, I bet
it's no better than Bayes and possibly even worse.

Furthermore, "concepts" is playing a game of whack-a-mole as spammers
come up with more creative misspellings and other variations on
evading the concept detector.  Do you really want to spend your days
writing rules to detect:

vіÅᏀʀâ, Ꮩɩɑɢᚱà, etc. and all the exponentially-numerous possible combinations?

(Credits to http://www.irongeek.com/homoglyph-attack-generator.php)

Regards,

Dianne.

Re: SA Concepts - plugin for email semantics

Posted by Merijn van den Kroonenberg <me...@web2all.nl>.
> It may come down to my understanding of Bayes and its tokens.. Also
> having a bit a problem explaining this concept on paper...
>
> I see this as adding an extra layer to the Bayes:
>
> Consider the following 2 basic emails:
>
> Mail 1:
> Viagra
>
> Mail 2:
> V1agra
>
>
> With Bayes:
>
> Mail 1:
> <token 1>
>
> Mail 2:
> <token 2>
>
> With Concepts & Bayes:
>
> Mail 1:
> <token 1>
> <meds>
>
> Mail 2:
> <token 2>
> <meds>
>
> ---
>
> So without Concepts:
>
> Mail 1 comes into the platform, is tokenized (token1) and is classified
> and learnt as spam.
> Mail 2 comes into the platform, tokenized (token2) and has no common
> tokens with mail 1 - so no association is made

Why is mail 1 classified and learned as spam and mail 2 not?
Classification and learning are two separate matters. Learning can be done
automatic (based on other rules) or manual.

In your example both Viagra and V1agra could be auto learned as spam based
on other rules. And also by hand. There is basically no difference between
them.

>
> With Concepts
>
> Mail 1 comes into the platform, is tokenized (token1 & meds) and is
> classified and learnt as spam.
> Mail 2 comes into the platform, is tokenized (token2 & meds) and has the
> same common "meds" token as associated with Mail 1
>
> Does this makes sense - am I right in my assumptions?

I think it might actually complicate matters for bayes. Now you introduced
a 3rd token, 'meds'. This token is now also considered when bayes decides
if its ham or spam. However, meds is not an unique token, it already
exists (because i can write a mail asking about my dads meds, or other
spam might mention it).

So how does this affect all mail with 'meds' in it? Does it classify more
ham? But maybe also more false positives? Or maybe the opposite, because
meds is used in alot of legit mail?

I think its very difficult to tamper with this, introducing new content.

Then there is the effort needed of maintaining your concepts (i assume all
asociations are made by hand, i didnt look at the code yet). It is most
likely always outdated. The original bayes filter would know about V1agra
before you did, and added it to your concepts. And once bayes knows it
already, what is the point of creating a concept of it for bayes?

It would be interesting to see what a new bayes db would do which is ONLY
trained with your concepts keywords. This would be a very small bayes db i
guess. Curious if it could be effective in any way.

>
> Paul
>
> On 25/05/16 09:02, Merijn van den Kroonenberg wrote:
>>> With David's help I have tracked down the problem(s). Version 0.02 is
>>> up. Would be interested to hear you thoughts - even if just theoretical
>>> about the affect to the Bayes DB.
>> Just in theory, i am curious what part of the Bayes filter you hope to
>> improve? I think you are not adding any *new* information to the e-mail,
>> your concepts are based purely on the mail content right?
>>
>> It seems you just overpower some tokens a bit more but I am not sure if
>> your concepts are useful for a bayes filter. Especially a bayes filter
>> would not need this I would say. Maybe the concepts would be useful to
>> humans or rules written by humans.
>>
>>> Paul
>>> --
>>> Paul Stead
>>> Systems Engineer
>>> Zen Internet
>>>
>>
>
> --
> Paul Stead
> Systems Engineer
> Zen Internet
>



Re: SA Concepts - plugin for email semantics

Posted by Paul Stead <pa...@zeninternet.co.uk>.
It may come down to my understanding of Bayes and its tokens.. Also
having a bit a problem explaining this concept on paper...

I see this as adding an extra layer to the Bayes:

Consider the following 2 basic emails:

Mail 1:
Viagra

Mail 2:
V1agra


With Bayes:

Mail 1:
<token 1>

Mail 2:
<token 2>

With Concepts & Bayes:

Mail 1:
<token 1>
<meds>

Mail 2:
<token 2>
<meds>

---

So without Concepts:

Mail 1 comes into the platform, is tokenized (token1) and is classified
and learnt as spam.
Mail 2 comes into the platform, tokenized (token2) and has no common
tokens with mail 1 - so no association is made

With Concepts

Mail 1 comes into the platform, is tokenized (token1 & meds) and is
classified and learnt as spam.
Mail 2 comes into the platform, is tokenized (token2 & meds) and has the
same common "meds" token as associated with Mail 1

Does this makes sense - am I right in my assumptions?

Paul

On 25/05/16 09:02, Merijn van den Kroonenberg wrote:
>> With David's help I have tracked down the problem(s). Version 0.02 is
>> up. Would be interested to hear you thoughts - even if just theoretical
>> about the affect to the Bayes DB.
> Just in theory, i am curious what part of the Bayes filter you hope to
> improve? I think you are not adding any *new* information to the e-mail,
> your concepts are based purely on the mail content right?
>
> It seems you just overpower some tokens a bit more but I am not sure if
> your concepts are useful for a bayes filter. Especially a bayes filter
> would not need this I would say. Maybe the concepts would be useful to
> humans or rules written by humans.
>
>> Paul
>> --
>> Paul Stead
>> Systems Engineer
>> Zen Internet
>>
>

--
Paul Stead
Systems Engineer
Zen Internet

Re: SA Concepts - plugin for email semantics

Posted by Merijn van den Kroonenberg <me...@web2all.nl>.
>
> With David's help I have tracked down the problem(s). Version 0.02 is
> up. Would be interested to hear you thoughts - even if just theoretical
> about the affect to the Bayes DB.

Just in theory, i am curious what part of the Bayes filter you hope to
improve? I think you are not adding any *new* information to the e-mail,
your concepts are based purely on the mail content right?

It seems you just overpower some tokens a bit more but I am not sure if
your concepts are useful for a bayes filter. Especially a bayes filter
would not need this I would say. Maybe the concepts would be useful to
humans or rules written by humans.

>
> Paul
> --
> Paul Stead
> Systems Engineer
> Zen Internet
>



Re: SA Concepts - plugin for email semantics

Posted by Paul Stead <pa...@zeninternet.co.uk>.

On 24/05/16 17:09, David Jones wrote:
> Good idea.  I would like to test this out so I put this on my CentOS 6 servers
> (perl  v5.10.1) and got this:
>
> May 24 10:59:51.850 [30158] warn: plugin: failed to parse plugin /etc/mail/spamassassin/Concepts.pm: Type of arg 1 to push must be array (not private variable) at /etc/mail/spamassassin/Concepts.pm line 84, near "$headl;"
> ...

With David's help I have tracked down the problem(s). Version 0.02 is
up. Would be interested to hear you thoughts - even if just theoretical
about the affect to the Bayes DB.

Paul
--
Paul Stead
Systems Engineer
Zen Internet

Re: SA Concepts - plugin for email semantics

Posted by David Jones <dj...@ena.com>.
>From: Paul Stead <pa...@zeninternet.co.uk>
>Sent: Tuesday, May 24, 2016 9:55 AM
>To: users@spamassassin.apache.org
>Subject: SA Concepts - plugin for email semantics

>Hi guys,

>Based upon some information from others on the list I have put together
>a plugin for SA which canonicalises an email into it's basic "concepts".
>Concepts are converted to tags, which Bayes can use as tokens to further
>help identify spammy/hammy characteristics

>Here are some examples of tags from some emails today -

>---8<---
>X-SA-Concepts: experience regards money optout time-ref dear great home
>request member enjoy woman-adj important online click all-rights
>email-adr please price best hot-adj
>X-SA-Concepts: experience contact optout winner time-ref survey dear
>home privacy prize store thankyou important click gift chance please
>X-SA-Concepts: google law search-eng optout amazing order facebook
>goodtime privacy lotsofmoney request enjoy details service partner
>linkedin twitter trust contact time-ref great online click shop
>email-adr please customer newsletter news
>X-SA-Concepts: photos view-online money contact optout time-ref cost
>reply2me service details online click please
>X-SA-Concepts: friend hotwords trust experience regards contact time-ref
>medical woman drugs consultant pill mailto woman-adj secret health earn
>email-adr please security hot-adj day-of-week
>X-SA-Concepts: https mailto re euros regards money youtube invoice
>email-adr facebook best hair
>---8<---

>This plugin essentially adds an extra layer between the raw input
>characteristics and recognition types - allowing clustering of different
>characteristics to a more generic type - in effect giving Bayes more of
>a two-layer neural network approach.

>When combined with Bayes learning these email semantics (or Concepts)
>can then be combined with the multiple other characteristics of that
>email, to then be compared to other email that came before it.

>https://github.com/fmbla/spamassassin-concepts

>I'd be really interested to hear your feedback/thoughts on this system
>and it's approach.

>Paul

Good idea.  I would like to test this out so I put this on my CentOS 6 servers
(perl  v5.10.1) and got this:

May 24 10:59:51.850 [30158] warn: plugin: failed to parse plugin /etc/mail/spamassassin/Concepts.pm: Type of arg 1 to push must be array (not private variable) at /etc/mail/spamassassin/Concepts.pm line 84, near "$headl;"
May 24 10:59:51.850 [30158] warn: Type of arg 1 to push must be array (not private variable) at /etc/mail/spamassassin/Concepts.pm line 88, near ");"
May 24 10:59:51.850 [30158] warn: Type of arg 1 to keys must be hash (not hash element) at /etc/mail/spamassassin/Concepts.pm line 93, near "}) "
May 24 10:59:51.850 [30158] warn: Type of arg 1 to keys must be hash (not private variable) at /etc/mail/spamassassin/Concepts.pm line 104, near "$matched_concepts) "
May 24 10:59:51.850 [30158] warn: Type of arg 1 to push must be array (not hash element) at /etc/mail/spamassassin/Concepts.pm line 168, near "$re if"
May 24 10:59:51.850 [30158] warn: Type of arg 1 to keys must be hash (not private variable) at /etc/mail/spamassassin/Concepts.pm line 174, near "$concepts;"
May 24 10:59:51.850 [30158] warn: Compilation failed in require at /usr/share/perl5/vendor_perl/Mail/SpamAssassin/PluginHandler.pm line 109.
May 24 10:59:52.472 [30158] warn: config: failed to parse line, skipping, in "/etc/mail/spamassassin/41_concepts.cf": concepts_dir /etc/mail/spamassassin/concepts
May 24 10:59:52.472 [30158] warn: Unrecognized escape \l passed through in regex; marked by <-- HERE in m/\l <-- HERE otsofmoney\b/ at /usr/share/perl5/vendor_perl/Mail/SpamAssassin/Conf/Parser.pm line 1388.
May 24 10:59:54.646 [30158] warn: lint: 1 issues detected, please rerun with debug enabled for more information

Thanks for sharing your code and time you put into this,
Dave