You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Marc Perkel <su...@junkemailfilter.com> on 2016/01/21 07:21:49 UTC

Can your bayes do this?

OK - Just to show you this isn't Bayesian - see if you can do this.

Here is a list of 5505874 words and phrases used in the subject line of 
HAM and never seen in the subject line of SPAM

http://www.junkemailfilter.com/data/subject-ham.txt

Here is a list of 3494938 words and phrases used in the subject line of 
SPAM and never seen in the subject line of HAM

http://www.junkemailfilter.com/data/subject-spam.txt

Hope you understand it now. Not Bayesian!!!!

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: Can your bayes do this?

Posted by Dave Warren <da...@hireahit.com>.

On 2016-01-20 22:21, Marc Perkel wrote:
> Here is a list of 3494938 words and phrases used in the subject line 
> of SPAM and never seen in the subject line of HAM
>
> http://www.junkemailfilter.com/data/subject-spam.txt

I thought I'd take you up on this using a combination of my corpus, and 
the other mail I have indexed and trivially searchable which is not 
necessarily corpus quality, but which I can review casually, so I looked 
through your list of "words and phrases... never seen in the subject 
line of HAM" that I thought I might find in my collection of ham and 
here we go:

"alert you have"
"almost done!"
"almost go"
"application declined"
"application support"
"at any time dave" <-- Found one in my own mailbox! Woot!
"audible app" <-- Audible themselves used this in 2014.
"audio with" <-- Are you kidding? A bunch of hits from my mailbox, I see 
a bunch from OpenBSD's mailing lists, ffmpeg.org, and other places.

My ham indexes are tokenized stripping punctuation, I found over a 
hundred hits for "almost done" and manually reviewed, I found at least 
two "almost done!" in the first dozen and got bored. A ton of mail is 
already excluded for various reasons. For results with a small number, I 
manually reviewed to rate spamminess, for larger numbers of hits I got 
bored once I found a few strong hits.

I'm looking for substring matches, not necessarily anchored to the start 
or end of the subject, but a good chunk of these comprise the entire 
subject line ("almost done!", "application support" "application 
declined"), so even if you're not looking at substrings, it's still a 
sloppy mess.

This is only on a few million messages that comprise a very narrow slice 
of the mail flow on the internet, and only from those customers where I 
can query their mail trivially.

> Hope you understand it now. Not Bayesian!!!! 

Perhaps not, but it seems like it's a natural precursor to a bayesian 
implementation. As RW said further down the thread:

> the only difference between
>
>    "ambulatory care" -> only in ham
>    "aall cards"      -> only in spam
>
> and
>
>     "ambulatory care"  occurs 16 times in ham and 0 times in spam
>     "aall cards"       occurs  0 times in ham and 3 times in spam
>
> is that you have discarded the count information.

And count information is important in determining the likely 
trustworthiness of a result. What would your system do with a phrase 
that appears in thousands of ham messages, and 2 spam messages? Ignore 
it completely?

-- 
Dave Warren
http://www.hireahit.com/
http://ca.linkedin.com/in/davejwarren

Re: Can your bayes do this?

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Wed, 20 Jan 2016 22:21:49 -0800
Marc Perkel <su...@junkemailfilter.com> wrote:

> Here is a list of 5505874 words and phrases used in the subject line
> of HAM and never seen in the subject line of SPAM

> Here is a list of 3494938 words and phrases used in the subject line
> of SPAM and never seen in the subject line of HAM

[snip]

And what, exactly, is your point?  Bayes would handle that just fine.
Tokens in your first list would score 0.00 for spam probability and
tokens in your second list would score 1.00 and Bayes would be great.

Regards,

Dianne.

Re: Can your bayes do this?

Posted by Matthias Apitz <gu...@unixarea.de>.

El día Wednesday, January 20, 2016 a las 10:21:49PM -0800, Marc Perkel escribió:

> OK - Just to show you this isn't Bayesian - see if you can do this.
> 
> Here is a list of 5505874 words and phrases used in the subject line of 
> HAM and never seen in the subject line of SPAM
> 
> http://www.junkemailfilter.com/data/subject-ham.txt
> 
> Here is a list of 3494938 words and phrases used in the subject line of 
> SPAM and never seen in the subject line of HAM
> 
> http://www.junkemailfilter.com/data/subject-spam.txt
> 
> Hope you understand it now. Not Bayesian!!!!
> 
> -- 
> Marc Perkel - Sales/Support
> support@junkemailfilter.com
> http://www.junkemailfilter.com
> Junk Email Filter dot com
> 415-992-3400

Some how all this thread smells as advertisement for some company, or is
it only me, who feels this?

	matthias


-- 
Matthias Apitz, ✉ guru@unixarea.de, ⌂ http://www.unixarea.de/  ☎ +49-176-38902045
UNIX since V7 on PDP-11 | UNIX on mainframe since ESER 1055 (IBM /370)
UNIX on x86 since SVR4.2 UnixWare 2.1.2 | FreeBSD since 2.2.5

Re: Can your bayes do this?

Posted by Antony Stone <An...@spamassassin.open.source.it>.

On Thursday 21 January 2016 at 13:11:15, RW wrote:

> On Wed, 20 Jan 2016 22:21:49 -0800 Marc Perkel wrote:
> > OK - Just to show you this isn't Bayesian - see if you can do this.
> > 
> > Here is a list of 5505874 words and phrases used in the subject line
> > of HAM and never seen in the subject line of SPAM
> > 
> > http://www.junkemailfilter.com/data/subject-ham.txt
> > 
> > Here is a list of 3494938 words and phrases used in the subject line
> > of SPAM and never seen in the subject line of HAM
> > 
> > http://www.junkemailfilter.com/data/subject-spam.txt
> > 
> > Hope you understand it now. Not Bayesian!!!!
> 
> the only difference between
> 
> 
>   "ambulatory care" -> only in ham
>   "aall cards"      -> only in spam
> 
> and
> 
>    "ambulatory care"  occurs 16 times in ham and 0 times in spam
> 
>    "aall cards"       occurs  0 times in ham and 3 times in spam
> 
> is that you have discarded the count information.

Plus, the "never in ham" and "never in spam" lists omit any mention of words & 
phrases which exist in differing proportions in both - Bayes includes that, and 
I would expect that a spam identifier which takes account of as many known 
charactersistics of spam/ham as possible is going to do the best job.


Antony.

-- 
Software development can be quick, high quality, or low cost.

The customer gets to pick any two out of three.

                                                   Please reply to the list;
                                                         please *don't* CC me.

Re: Can your bayes do this?

Posted by Reindl Harald <h....@thelounge.net>.

Am 21.01.2016 um 14:17 schrieb RW:
> On Thu, 21 Jan 2016 13:45:08 +0100
> Christian Laußat wrote:
>
>> Am 21.01.2016 13:19, schrieb Reindl Harald:
>>> no entirely when "urrently, SA's bayes tokens are single words" from
>>> https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3C509D55A8.30601@gmail.com%3E
>>> is still true
>>>
>>> please review that response below and consider 2/4 word tokes
>>> *additionally* in the SA-tokenizer and it will beat out the "new
>>> magic" easily witha well trained bayes in all cases
>>
>> Bogofilter has an option to specify how many tokens to put into
>> bayes. Here is an analysis of how effective this was:
>> http://www.bogofilter.org/pipermail/bogofilter-dev/2006q3/003349.html
>>
>> In my opinion it's not worth the effort. You'll blow up your database
>> for little better matching rate.
>
> The FNs dropped from 287 to 69, which I'd call a four-fold improvement.
>
> The FPs rose from 0 to 1, but that mail was ham quoting a full spam, so
> arguably it just did a better job in detecting the embedded spam.

also see http://www.paulgraham.com/sofar.html

When the spammers do try to rewrite their messages, they'll probably do 
it by replacing individual spammy tokens with phrases of more neutral 
words. But multi-word filters will learn and catch these phrases too
_____________________________________

in doubt that "blown up database" can have the effect that you need less 
training samples for the same outcome

Re: Can your bayes do this?

Posted by Reindl Harald <h....@thelounge.net>.

Am 21.01.2016 um 17:53 schrieb John Hardin:
> On Thu, 21 Jan 2016, RW wrote:
>
>> On Thu, 21 Jan 2016 14:31:09 +0100
>> Christian Laußat wrote:
>>
>>> Am 21.01.2016 14:17, schrieb RW:
>>>> The FNs dropped from 287 to 69, which I'd call a four-fold
>>>> improvement.
>>>>
>>>> The FPs rose from 0 to 1, but that mail was ham quoting a full
>>>> spam, so arguably it just did a better job in detecting the
>>>> embedded spam.
>>>
>>> Yes, but is it really worth the resources? I mean, the database got
>>> 13 time larger for 3 word token, and with more words per token it
>>> will grow exponentially.
>>
>> But if you are training on error it only grows by a factor of 3.1
>> (13*69/287).  You also have to consider what happens if you simply
>> reduce the retention time by a factor of 3.1 - that corpus had 4 years
>> retention so it's unlikely that maintaining a constant size database
>> would have made much difference in this case. When you train from
>> corpus the database size is dominated by ephemeral tokens which makes
>> the situation look worse than it is.
>>
>> It depends what you want. I don't care about an extra 100 MB
>> of disk space and a few milliseconds if it gives any measurable
>> improvement.
>>
>> Personally I wouldn't like to see Bayes go multi-word because it would
>> likely end-up as a poor compromise. Two-word tokenization is the
>> default on DSPAM, but I've not seen anyone advocate using it. I think
>> it's better to score in an external filter that runs in addition to
>> Bayes.
>
> There was an improvement in FP and FN from two tokens. The marginal
> improvement from three doesn't seem worth it.
>
> I'd like to see a SA Bayes config option to select between one-word and
> two-word tokens


not only you!

like "bayes_token_sources all" was introduced a "bayes_multiword_tokens 
<integer>" would be perfect dsiabled by default, so one could easily 
verify the differences with a existing corpus and what's the best result

like the mime-tokens these should be additional ones to the in any case 
generated 1-word-tokens
_________________________

for "Two-word tokenization is the default on DSPAM, but I've not seen 
anyone advocate using it" - just because it is a dead project, looking 
only at the bayes-implementation i have read more than once it's better 
then SA and the reason to not consider it was the fact it's dead and 
full of unfixed bugs

Re: Can your bayes do this?

Posted by John Hardin <jh...@impsec.org>.

On Thu, 21 Jan 2016, RW wrote:

> On Thu, 21 Jan 2016 08:53:10 -0800 (PST)
> John Hardin wrote:
>
>> There was an improvement in FP and FN from two tokens. The marginal
>> improvement from three doesn't seem worth it.
>
> The improvement from 2 to 3 is more substantial than from 1 to 2
>
> 287/160 = 1.79
>
> 160/69  = 2.3

Ugh. I looked at the raw numbers rather than the ratio - sorry.

287/69 looks even better, 4.2

> Whether any of this is worth it depends on a lot of things. I don't
> think it's even obvious whether 3-word tokenization is more resource
> intensive than 2-word. Clearly in the limit where ntokens goes to
> infinity  3-word will outperform 2-word at the same database size,
> which means that it can achieve the same level of performance with a
> smaller database. I've no feeling for what value of ntokens that
> switches around.

So it should be configurable, and if you change it you monitor token 
database size and scan times and FP/FN rate and adjust token expiry 
to manage, or switch it back to 1 if the improvement costs too much.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Maxim IV: Close air support covereth a multitude of sins.
-----------------------------------------------------------------------
  2 days until John Moses Browning's 161st Birthday

Re: Can your bayes do this?

Posted by Reindl Harald <h....@thelounge.net>.


Am 21.01.2016 um 20:38 schrieb RW:
> On Thu, 21 Jan 2016 08:53:10 -0800 (PST)
> John Hardin wrote:
>
>
>> There was an improvement in FP and FN from two tokens. The marginal
>> improvement from three doesn't seem worth it.
>
> The improvement from 2 to 3 is more substantial than from 1 to 2
>
>   287/160 = 1.79
>
>   160/69  = 2.3
>
> Whether any of this is worth it depends on a lot of things. I don't
> think it's even obvious whether 3-word tokenization is more resource
> intensive than 2-word. Clearly in the limit where ntokens goes to
> infinity  3-word will outperform 2-word at the same database size,
> which means that it can achieve the same level of performance with a
> smaller database. I've no feeling for what value of ntokens that
> switches around

if SA would provide a param to add additional like 
"bayes_multiword_tokens <integer>" i could test it against 80000 
messages with different <integer> params and there is also a 700 entry 
long ignore-list for our daily check which could also be tested 
automatically if they swap over to BAYES_999 like the rest and all 
ham-samples still have BAYES_00

i run that tests every night against he whole corpus with a report to 
detected mis-training when previously as BAYES_999 or BAYES_00 
classified samples change their result

that's done with a dedicated SA-instance doing only bayes test and 
nothing else feeded by "spamc" and parsing the outputs, takes around 1 
hour on the current hardware
________________________

the exclude list can be checked with a param isolated and anything which 
reached BAYES_999 is automatically removed, looks like below (no the 
worker scripts are not runnining as root)

so the first test would fire that with 2,3,4 word-tokes and look how 
many samples chnage to BAYES_999 while no ham-samples from the large 
tests are lose their BAYES_00

i can clone that machine and re-build the whole bayes database from 
scratch within 15 minutes from the corpus files
________________________

[root@mail-gw:~]$ corpus-stats ignored
NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-14-54-26-20d340f85ff0e415a34776f2ddac2f98.eml
1 / 639 (SPAM: 2016-01-20-14-54-26-20d340f85ff0e415a34776f2ddac2f98.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-13-39-17-ecd6cd231935b352cd1c184224987b03.eml
2 / 639 (SPAM: 2016-01-20-13-39-17-ecd6cd231935b352cd1c184224987b03.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-13-39-17-41624e1f3a9314bbf56fedfbc3e56e11.eml
3 / 639 (SPAM: 2016-01-20-13-39-17-41624e1f3a9314bbf56fedfbc3e56e11.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-ada25ecf2eb04344e23d853bc59a85b2.eml
4 / 639 (SPAM: 2016-01-20-12-17-18-ada25ecf2eb04344e23d853bc59a85b2.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-234598268235618c8167a1e9c93701c8.eml
5 / 639 (SPAM: 2016-01-20-12-17-18-234598268235618c8167a1e9c93701c8.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-720e314e86bf81550966764d7fd8d802.eml
6 / 639 (SPAM: 2016-01-20-12-17-18-720e314e86bf81550966764d7fd8d802.eml)

NON-BAYES-999: 
/var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-34df375a0ac059678e7d053bad31acdc.eml
7 / 639 (SPAM: 2016-01-20-12-17-18-34df375a0ac059678e7d053bad31acdc.eml)

Re: Can your bayes do this?

Posted by RW <rw...@googlemail.com>.

On Thu, 21 Jan 2016 08:53:10 -0800 (PST)
John Hardin wrote:

> There was an improvement in FP and FN from two tokens. The marginal 
> improvement from three doesn't seem worth it.

The improvement from 2 to 3 is more substantial than from 1 to 2

 287/160 = 1.79

 160/69  = 2.3

Whether any of this is worth it depends on a lot of things. I don't
think it's even obvious whether 3-word tokenization is more resource
intensive than 2-word. Clearly in the limit where ntokens goes to
infinity  3-word will outperform 2-word at the same database size,
which means that it can achieve the same level of performance with a
smaller database. I've no feeling for what value of ntokens that
switches around.

Re: Can your bayes do this?

Posted by John Hardin <jh...@impsec.org>.

On Thu, 21 Jan 2016, RW wrote:

> On Thu, 21 Jan 2016 14:31:09 +0100
> Christian Laußat wrote:
>
>> Am 21.01.2016 14:17, schrieb RW:
>>> The FNs dropped from 287 to 69, which I'd call a four-fold
>>> improvement.
>>>
>>> The FPs rose from 0 to 1, but that mail was ham quoting a full
>>> spam, so arguably it just did a better job in detecting the
>>> embedded spam.
>>
>> Yes, but is it really worth the resources? I mean, the database got
>> 13 time larger for 3 word token, and with more words per token it
>> will grow exponentially.
>
> But if you are training on error it only grows by a factor of 3.1
> (13*69/287).  You also have to consider what happens if you simply
> reduce the retention time by a factor of 3.1 - that corpus had 4 years
> retention so it's unlikely that maintaining a constant size database
> would have made much difference in this case. When you train from
> corpus the database size is dominated by ephemeral tokens which makes
> the situation look worse than it is.
>
> It depends what you want. I don't care about an extra 100 MB
> of disk space and a few milliseconds if it gives any measurable
> improvement.
>
> Personally I wouldn't like to see Bayes go multi-word because it would
> likely end-up as a poor compromise. Two-word tokenization is the
> default on DSPAM, but I've not seen anyone advocate using it. I think
> it's better to score in an external filter that runs in addition to
> Bayes.

There was an improvement in FP and FN from two tokens. The marginal 
improvement from three doesn't seem worth it.

I'd like to see a SA Bayes config option to select between one-word and 
two-word tokens.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Public Education: the bureaucratic process of replacing
   an empty mind with a closed one.                          -- Thorax
-----------------------------------------------------------------------
  2 days until John Moses Browning's 161st Birthday

Re: Can your bayes do this?

Posted by RW <rw...@googlemail.com>.

On Thu, 21 Jan 2016 14:31:09 +0100
Christian Laußat wrote:

> Am 21.01.2016 14:17, schrieb RW:
> > The FNs dropped from 287 to 69, which I'd call a four-fold
> > improvement.
> > 
> > The FPs rose from 0 to 1, but that mail was ham quoting a full
> > spam, so arguably it just did a better job in detecting the
> > embedded spam.  
> 
> Yes, but is it really worth the resources? I mean, the database got
> 13 time larger for 3 word token, and with more words per token it
> will grow exponentially.

But if you are training on error it only grows by a factor of 3.1
(13*69/287).  You also have to consider what happens if you simply
reduce the retention time by a factor of 3.1 - that corpus had 4 years
retention so it's unlikely that maintaining a constant size database
would have made much difference in this case. When you train from
corpus the database size is dominated by ephemeral tokens which makes
the situation look worse than it is. 

It depends what you want. I don't care about an extra 100 MB
of disk space and a few milliseconds if it gives any measurable
improvement. 

Personally I wouldn't like to see Bayes go multi-word because it would
likely end-up as a poor compromise. Two-word tokenization is the
default on DSPAM, but I've not seen anyone advocate using it. I think
it's better to score in an external filter that runs in addition to
Bayes.

Re: Can your bayes do this?

Posted by RW <rw...@googlemail.com>.

On Thu, 21 Jan 2016 13:45:08 +0100
Christian Laußat wrote:

> Am 21.01.2016 13:19, schrieb Reindl Harald:
> > no entirely when "urrently, SA's bayes tokens are single words" from
> > https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3C509D55A8.30601@gmail.com%3E
> > is still true
> > 
> > please review that response below and consider 2/4 word tokes
> > *additionally* in the SA-tokenizer and it will beat out the "new
> > magic" easily witha well trained bayes in all cases  
> 
> Bogofilter has an option to specify how many tokens to put into
> bayes. Here is an analysis of how effective this was:
> http://www.bogofilter.org/pipermail/bogofilter-dev/2006q3/003349.html
> 
> In my opinion it's not worth the effort. You'll blow up your database 
> for little better matching rate.

The FNs dropped from 287 to 69, which I'd call a four-fold improvement.

The FPs rose from 0 to 1, but that mail was ham quoting a full spam, so
arguably it just did a better job in detecting the embedded spam.

Re: Can your bayes do this?

Posted by Christian Laußat <sp...@list.laussat.de>.

Am 21.01.2016 13:19, schrieb Reindl Harald:
> no entirely when "urrently, SA's bayes tokens are single words" from
> https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3C509D55A8.30601@gmail.com%3E
> is still true
> 
> please review that response below and consider 2/4 word tokes
> *additionally* in the SA-tokenizer and it will beat out the "new
> magic" easily witha well trained bayes in all cases

Bogofilter has an option to specify how many tokens to put into bayes. 
Here is an analysis of how effective this was:
http://www.bogofilter.org/pipermail/bogofilter-dev/2006q3/003349.html

In my opinion it's not worth the effort. You'll blow up your database 
for little better matching rate.

-- 
Christian Laußat
https://blog.laussat.de

Re: Can your bayes do this?

Posted by RW <rw...@googlemail.com>.

On Thu, 21 Jan 2016 13:19:20 +0100
Reindl Harald wrote:

> Am 21.01.2016 um 13:11 schrieb RW:
> > On Wed, 20 Jan 2016 22:21:49 -0800
> > Marc Perkel wrote:
> >  
> >> OK - Just to show you this isn't Bayesian - see if you can do this.
> >>
> >> Here is a list of 5505874 words and phrases used in the subject
> >> line of HAM and never seen in the subject line of SPAM
> >>
> >> http://www.junkemailfilter.com/data/subject-ham.txt
> >>
> >> Here is a list of 3494938 words and phrases used in the subject
> >> line of SPAM and never seen in the subject line of HAM
> >>
> >> http://www.junkemailfilter.com/data/subject-spam.txt
> >>
> >> Hope you understand it now. Not Bayesian!!!!  
> >
> >
> > the only difference between
> >
> >
> >    "ambulatory care" -> only in ham
> >    "aall cards"      -> only in spam
> >
> > and
> >
> >     "ambulatory care"  occurs 16 times in ham and 0 times in spam
> >     "aall cards"       occurs  0 times in ham and 3 times in spam
> >
> > is that you have discarded the count information  
> 
> no entirely when "urrently, SA's bayes tokens are single words" from 


Yes, obviously. The assertion was that it's doing something that a
Bayesian filter can't  -  not specifically Bayes.

Re: Can your bayes do this?

Posted by Reindl Harald <h....@thelounge.net>.


Am 21.01.2016 um 13:11 schrieb RW:
> On Wed, 20 Jan 2016 22:21:49 -0800
> Marc Perkel wrote:
>
>> OK - Just to show you this isn't Bayesian - see if you can do this.
>>
>> Here is a list of 5505874 words and phrases used in the subject line
>> of HAM and never seen in the subject line of SPAM
>>
>> http://www.junkemailfilter.com/data/subject-ham.txt
>>
>> Here is a list of 3494938 words and phrases used in the subject line
>> of SPAM and never seen in the subject line of HAM
>>
>> http://www.junkemailfilter.com/data/subject-spam.txt
>>
>> Hope you understand it now. Not Bayesian!!!!
>
>
> the only difference between
>
>
>    "ambulatory care" -> only in ham
>    "aall cards"      -> only in spam
>
> and
>
>     "ambulatory care"  occurs 16 times in ham and 0 times in spam
>     "aall cards"       occurs  0 times in ham and 3 times in spam
>
> is that you have discarded the count information

no entirely when "urrently, SA's bayes tokens are single words" from 
https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3C509D55A8.30601@gmail.com%3E 
is still true

please review that response below and consider 2/4 word tokes 
*additionally* in the SA-tokenizer and it will beat out the "new magic" 
easily witha well trained bayes in all cases

-------- Weitergeleitete Nachricht --------
Betreff: Re: My new method for blocking spam - REVEALED!
Datum: Wed, 20 Jan 2016 15:20:01 -0500
Von: Dianne Skoll <df...@roaringpenguin.com>
Organisation: Roaring Penguin Software Inc.
An: users@spamassassin.apache.org

On Wed, 20 Jan 2016 12:11:02 -0800
Marc Perkel <su...@junkemailfilter.com> wrote:

 > Again - it's not about matching as Bayes does. It's about not
 > matching.

It's not about not matching. It's about a preprocessing step that
discards tokens that don't have extreme probabilities.

I think your method works as well as it does because you're using up
to four-word phrases as tokens. The rest of the method is nonsense, but
the four-word phrase tokens are the magic ingredient; they'd make Bayes 
work awesomely also.

Re: Can your bayes do this?

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Thu, 21 Jan 2016 12:11:15 +0000
RW <rw...@googlemail.com> wrote:

>   "ambulatory care" -> only in ham
...
> is that you have discarded the count information.

And his assertion is not necessarily true, either.  According to our
statistics, we've seen "ambulatory care" in 1400 spams, but also in 22
spams.  While 1400/1422 still makes the token useful for Bayes, his algorithm
would discount it altogether because it's not "pure" ham.

Regards,

Dianne.

Re: Can your bayes do this?

Posted by RW <rw...@googlemail.com>.

On Wed, 20 Jan 2016 22:21:49 -0800
Marc Perkel wrote:

> OK - Just to show you this isn't Bayesian - see if you can do this.
> 
> Here is a list of 5505874 words and phrases used in the subject line
> of HAM and never seen in the subject line of SPAM
> 
> http://www.junkemailfilter.com/data/subject-ham.txt
> 
> Here is a list of 3494938 words and phrases used in the subject line
> of SPAM and never seen in the subject line of HAM
> 
> http://www.junkemailfilter.com/data/subject-spam.txt
> 
> Hope you understand it now. Not Bayesian!!!!


the only difference between


  "ambulatory care" -> only in ham
  "aall cards"      -> only in spam

and 
   

   "ambulatory care"  occurs 16 times in ham and 0 times in spam
   
   "aall cards"       occurs  0 times in ham and 3 times in spam

is that you have discarded the count information.

Re: Can your bayes do this?

Posted by Reindl Harald <h....@thelounge.net>.


Am 21.01.2016 um 07:21 schrieb Marc Perkel:
> OK - Just to show you this isn't Bayesian - see if you can do this.
>
> Here is a list of 5505874 words and phrases used in the subject line of
> HAM and never seen in the subject line of SPAM
>
> http://www.junkemailfilter.com/data/subject-ham.txt
>
> Here is a list of 3494938 words and phrases used in the subject line of
> SPAM and never seen in the subject line of HAM
>
> http://www.junkemailfilter.com/data/subject-spam.txt
>
> Hope you understand it now. Not Bayesian!!!!

don't get me wrong but i don't take anybody serious who needs "!!!!" and 
when you don't stop advertising that aggressive you are classified as 
spammer too

177 MB only subjects?

well, not really impressive given that i easly get the same results with 
a 81 MB bayes-db containing the *complete* junk of 1.5 years while only 
selected ham (reported wrongly classified, my personal mail and a few 
inboxes from nice users)

when i can get with a 600 MB corpus containing around 81000 messages the 
same results the only thing i understand now is that it's not really 
efficient and needs access to all mails for training which is a no-go

[harry@srv-rhsoft:~]$ curl --head 
http://www.junkemailfilter.com/data/subject-spam.txt
HTTP/1.1 200 OK
Date: Thu, 21 Jan 2016 08:12:15 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Thu, 21 Jan 2016 06:11:41 GMT
ETag: "340315d-446e47c-529d1f9f0676b"
Accept-Ranges: bytes
Content-Length: 71754876
Connection: close
Content-Type: text/plain

[harry@srv-rhsoft:~]$ curl --head 
http://www.junkemailfilter.com/data/subject-ham.txt
HTTP/1.1 200 OK
Date: Thu, 21 Jan 2016 08:12:25 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Thu, 21 Jan 2016 06:09:18 GMT
ETag: "340309c-645b7a1-529d1f16ad5db"
Accept-Ranges: bytes
Content-Length: 105232289
Connection: close
Content-Type: text/plain