You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Marc Perkel <su...@junkemailfilter.com> on 2016/01/20 20:46:39 UTC

My new method for blocking spam - example

Let me give you an example. Here's 2 subject lines. Easy to guess which 
one is spam.

"Meet horny Russian Brides online!"
"I read an article about Russian brides in a magazine."

Bayes or spam assassin would look at "Russian Brides" and 499 out of 500 
times it's spam. Therefore the nonspam version scores spam points.

In my system "Russian brides" is neutral because it is used in both spam 
and ham. But on the spam side, phrases used in other spam *not matched* 
in ham.

Meet horny
horny Russian
horny Russian brides
brides online!
online!

On the ham side, phrases used in ham *not matched* in spam.

I read an article
read an article
an article about
brides in a magazine
in a magazine

My filter gets both correctly because of NOT matching. Not matching is a 
comparison to an infinite set.

Re: My new method for blocking spam - example

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Wed, 20 Jan 2016 11:46:39 -0800
Marc Perkel <su...@junkemailfilter.com> wrote:

> Let me give you an example. Here's 2 subject lines. Easy to guess
> which one is spam.

But those are easy for Bayes also.  Your filter (and Bayes) will
have trouble with the short micro-spams with fairly neutral words in them.

Regards,

Dianne.

Re: My new method for blocking spam - example

Posted by Reindl Harald <h....@thelounge.net>.


Am 20.01.2016 um 20:52 schrieb Marc Perkel:
>
> On 01/20/16 11:50, Reindl Harald wrote:
>>
>>
>> Am 20.01.2016 um 20:46 schrieb Marc Perkel:
>>> Let me give you an example. Here's 2 subject lines. Easy to guess which
>>> one is spam.
>>>
>>> "Meet horny Russian Brides online!"
>>> "I read an article about Russian brides in a magazine."
>>>
>>> Bayes or spam assassin would look at "Russian Brides" and 499 out of 500
>>> times it's spam. Therefore the nonspam version scores spam points.
>>>
>>> In my system "Russian brides" is neutral because it is used in both spam
>>> and ham. But on the spam side, phrases used in other spam *not matched*
>>> in ham
>>
>> that is *exactly* how bayes works and subject alone is *not* they key
>>
>> tokenizing the *whole* message with enough spam *and* ham samples is
>> the key - so there are two options:
>>
>> * you re-invited bayes with a different name
>> * you modified bayes with some tricks and hope
>>   spammers would not adopt them
>>
>> anyways, i doubt there is a sane reason for a patent because the
>> principles are just prior art -> bayes
>>
>>
>
> Again - Bayes compares what matches. My filter compares what doesn't match

and how is "what doesn't match" classified?

as spam?
so every mail it did not seen before is spammy?

as ham?
completly random junk is hammy?

Re: My new method for blocking spam - example

Posted by Tom Hendrikx <to...@whyscream.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 20-01-16 21:01, Dianne Skoll wrote:
> On Wed, 20 Jan 2016 11:52:35 -0800 Marc Perkel
> <su...@junkemailfilter.com> wrote:
> 
>> Again - Bayes compares what matches. My filter compares what
>> doesn't match.
> 
> Your filter is exactly equivalent to Bayes if you do the following 
> things:
> 
> 1) Use combinations of up to four words as tokens, instead of just 
> single tokens.
> 
> 2) Throw out any tokens whose probability is not either 100% spam
> or 100% ham.
> 
> Idea (1) is probably good.  We use words and word-pairs.  I'm not
> sure the extra storage for more than pairs is justifiable.
> 

Dspam implements up to 5 words, including wildcards for intermediate
words. See OSB and SBPH methods explained at
http://wiki.linuxwall.info/doku.php/en:ressources:dossiers:dspam#content
_tokenizing

In general, OSB gave the best results IIRC.

Regards,
	Tom
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJWn/kZAAoJEJPfMZ19VO/1JnMP/j7DCm1frjbJZy/uhyEZzskQ
+o010elBUdzubXgW0QIaHdxvkbbPgyu6yyEZHe/lj04h/vMgRmFHjbAWo+ZF1DZX
rjt2Aa85fLtoAxKojZjh/dfL0XaPfkTL3kAm8KQupe/2QRRAdVAfiC4iXKovvMSw
hBtj1Hjrsfj6j4h4k+JjtrSY+R4K8ugCTK004x+uNxUzrWUVt/WWhVet4DkEJrHR
P9JTy7tdh5qQjm0cY93689a5pKf7WBDQ1skAsKek0LWb3BhwUEuWclBR+PLhjtqR
0DJzdKh35I4Gl1k2zYb56xE1DkjiBoyQzR6FwvIeq8+9WZzMlDbxgTaz3vwOMsJ3
AcWyO1mG/FaNzzGLrcY0A586aeb9Gn+x+otFkx85vR3EBGvZMIrKbMq8V16uSldt
P87wM3aHsiR0OnqebytbFCmWQ4E+QG7jH+eIEqL2utWsyKKmrzXR/d4dd0w0Gkhl
65s5nm3/7UK6UI73fvnuY5/PFy5L7pFIcZGGeHiMeQlxyrduJvkmC8uCPEarMJ2D
X0xhOahUnglfxAPnjG31umB/9aJ1Z68f6pOEqw8IwmgeH5TxSZNEcuLexAFvSFYH
naB3A7Csv7Km46xdJ92XnH3yuD03zrxIOfBNymaUea6IYSZpCB3tQw95B7t1+7U+
IzOeONJNye+fdmHu5GBk
=M3sm
-----END PGP SIGNATURE-----

Re: My new method for blocking spam - example

Posted by Reindl Harald <h....@thelounge.net>.


Am 20.01.2016 um 21:24 schrieb John Hardin:
> On Wed, 20 Jan 2016, Dianne Skoll wrote:
>
>> On Wed, 20 Jan 2016 11:52:35 -0800
>> Marc Perkel <su...@junkemailfilter.com> wrote:
>>
>>> Again - Bayes compares what matches. My filter compares what doesn't
>>> match.
>>
>> Your filter is exactly equivalent to Bayes if you do the following
>> things:
>>
>> 1) Use combinations of up to four words as tokens, instead of just
>> single tokens.
>>
>> 2) Throw out any tokens whose probability is not either 100% spam or
>> 100% ham.
>>
>> Idea (1) is probably good.  We use words and word-pairs.  I'm not sure
>> the
>> extra storage for more than pairs is justifiable.
>
> Personally I'd rather see SA implement *that*

yes, the part below as *additional tokens* to what bayes does now

-------- Weitergeleitete Nachricht --------
Betreff: Re: My new method for blocking spam - REVEALED!
Datum: Wed, 20 Jan 2016 15:20:01 -0500
Von: Dianne Skoll <df...@roaringpenguin.com>
Organisation: Roaring Penguin Software Inc.
An: users@spamassassin.apache.org

On Wed, 20 Jan 2016 12:11:02 -0800
Marc Perkel <su...@junkemailfilter.com> wrote:

 > Again - it's not about matching as Bayes does. It's about not
 > matching.

It's not about not matching.  It's about a preprocessing step that
discards tokens that don't have extreme probabilities.

I think your method works as well as it does because you're using up
to four-word phrases as tokens.  The rest of the method is nonsense, but
the four-word phrase tokens are the magic ingredient; they'd make Bayes 
work awesomely also.

Re: My new method for blocking spam - example

Posted by John Hardin <jh...@impsec.org>.

On Wed, 20 Jan 2016, Dianne Skoll wrote:

> On Wed, 20 Jan 2016 11:52:35 -0800
> Marc Perkel <su...@junkemailfilter.com> wrote:
>
>> Again - Bayes compares what matches. My filter compares what doesn't
>> match.
>
> Your filter is exactly equivalent to Bayes if you do the following
> things:
>
> 1) Use combinations of up to four words as tokens, instead of just
> single tokens.
>
> 2) Throw out any tokens whose probability is not either 100% spam or 100% ham.
>
> Idea (1) is probably good.  We use words and word-pairs.  I'm not sure the
> extra storage for more than pairs is justifiable.

Personally I'd rather see SA implement *that*.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  3 days until John Moses Browning's 161st Birthday

Re: My new method for blocking spam - example

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Wed, 20 Jan 2016 11:52:35 -0800
Marc Perkel <su...@junkemailfilter.com> wrote:

> Again - Bayes compares what matches. My filter compares what doesn't
> match.

Your filter is exactly equivalent to Bayes if you do the following
things:

1) Use combinations of up to four words as tokens, instead of just
single tokens.

2) Throw out any tokens whose probability is not either 100% spam or 100% ham.

Idea (1) is probably good.  We use words and word-pairs.  I'm not sure the
extra storage for more than pairs is justifiable.

Idea (2) is probably bad.  You are throwing out potentially useful
information.

Regards,

Dianne.

Re: My new method for blocking spam - example

Posted by Marc Perkel <su...@junkemailfilter.com>.

On 01/20/16 11:50, Reindl Harald wrote:
>
>
> Am 20.01.2016 um 20:46 schrieb Marc Perkel:
>> Let me give you an example. Here's 2 subject lines. Easy to guess which
>> one is spam.
>>
>> "Meet horny Russian Brides online!"
>> "I read an article about Russian brides in a magazine."
>>
>> Bayes or spam assassin would look at "Russian Brides" and 499 out of 500
>> times it's spam. Therefore the nonspam version scores spam points.
>>
>> In my system "Russian brides" is neutral because it is used in both spam
>> and ham. But on the spam side, phrases used in other spam *not matched*
>> in ham
>
> that is *exactly* how bayes works and subject alone is *not* they key
>
> tokenizing the *whole* message with enough spam *and* ham samples is 
> the key - so there are two options:
>
> * you re-invited bayes with a different name
> * you modified bayes with some tricks and hope
>   spammers would not adopt them
>
> anyways, i doubt there is a sane reason for a patent because the 
> principles are just prior art -> bayes
>
>

Again - Bayes compares what matches. My filter compares what doesn't match.


-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: My new method for blocking spam - example

Posted by jdow <jd...@earthlink.net>.

On 2016-01-20 13:26, Matt Garretson wrote:
> I am not an expert but it does seem like the main novel thing is how
> (and how many) multi-word tokens are generated.  I use have been using
> multi-word tokens with bogofilter for years and it does help.  Of course
> bogofilter only uses adjacent words -- perhaps OP's way of combining
> words could yield an increase in accuracy, at the expense of processing
> time.
>
> The stuff about not-matching rather than matching seems like nonsense.
>
> Not to sound mean, but this is not the first time OP has come out with
> the latest greatest revolution in spam blocking.  :)  I admire his
> dedication, in any case!
>

Matt, it's amazing how many times this particular person has come up with the 
greatest secret sauce.... This reads like deja vu all over again to me.

{^_^}

Re: My new method for blocking spam - example

Posted by RW <rw...@googlemail.com>.

On Wed, 20 Jan 2016 16:26:19 -0500
Matt Garretson wrote:

> I am not an expert but it does seem like the main novel thing is how
> (and how many) multi-word tokens are generated.  I use have been using
> multi-word tokens with bogofilter for years and it does help.  Of
> course bogofilter only uses adjacent words -- perhaps OP's way of
> combining words could yield an increase in accuracy, at the expense
> of processing time.

It's exactly the same as bogofilter.

Re: My new method for blocking spam - example

Posted by Matt Garretson <ma...@assembly.state.ny.us>.

I am not an expert but it does seem like the main novel thing is how
(and how many) multi-word tokens are generated.  I use have been using
multi-word tokens with bogofilter for years and it does help.  Of course
bogofilter only uses adjacent words -- perhaps OP's way of combining
words could yield an increase in accuracy, at the expense of processing
time.

The stuff about not-matching rather than matching seems like nonsense.

Not to sound mean, but this is not the first time OP has come out with
the latest greatest revolution in spam blocking.  :)  I admire his
dedication, in any case!

Re: My new method for blocking spam - example

Posted by Reindl Harald <h....@thelounge.net>.


Am 20.01.2016 um 20:46 schrieb Marc Perkel:
> Let me give you an example. Here's 2 subject lines. Easy to guess which
> one is spam.
>
> "Meet horny Russian Brides online!"
> "I read an article about Russian brides in a magazine."
>
> Bayes or spam assassin would look at "Russian Brides" and 499 out of 500
> times it's spam. Therefore the nonspam version scores spam points.
>
> In my system "Russian brides" is neutral because it is used in both spam
> and ham. But on the spam side, phrases used in other spam *not matched*
> in ham

that is *exactly* how bayes works and subject alone is *not* they key

tokenizing the *whole* message with enough spam *and* ham samples is the 
key - so there are two options:

* you re-invited bayes with a different name
* you modified bayes with some tricks and hope
   spammers would not adopt them

anyways, i doubt there is a sane reason for a patent because the 
principles are just prior art -> bayes

Re: My new method for blocking spam - example

Posted by RW <rw...@googlemail.com>.

On Wed, 20 Jan 2016 11:46:39 -0800
Marc Perkel wrote:

> Let me give you an example. Here's 2 subject lines. Easy to guess
> which one is spam.
> 
> "Meet horny Russian Brides online!"
> "I read an article about Russian brides in a magazine."
> 
> Bayes or spam assassin would look at "Russian Brides" and 499 out of
> 500 times it's spam. Therefore the nonspam version scores spam points.

Not if you modify the the Robinson parameters and the cut-off to exclude
such tokens. Then only the tokens your system would use would make it
through to the final cut. 

> My filter gets both correctly because of NOT matching. Not matching
> is a comparison to an infinite set.

It's not an infinite set unless you assume that phrases never
seen before are spammy or hammy.