You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Marc Perkel <su...@junkemailfilter.com> on 2016/08/21 16:47:45 UTC

Matching infinite sets

Actually - you can match an infinite set. And maybe this is what it's 
hard for some people to wrap their head around.

Suppose set A contains 2 items, apples and oranges.
So we define set B as everything in the universe that is not in set A.
So set B is an infinite set, everything in the universe EXCEPT apples 
and oranges.

Our first test set contain an orange - so it matches set A and not set B.
Our second test set contains a cherry - so it doesn't match set A but it 
does match set B.

When you have a method that matches against infinite sets to completely 
changes how you think about spam and ham detection.

On 08/16/16 12:57, Shawn Bakhtiar wrote:
>
> /
> /
> /By they way, you can\u2019t match an infinite set (well theoretically but 
> not actually). /
> /https://en.wikipedia.org/wiki/Intersection_(set_theory)/ 
> <https://en.wikipedia.org/wiki/Intersection_%28set_theory%29>
> /
> /
>

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: Matching infinite sets

Posted by Antony Stone <An...@spamassassin.open.source.it>.

On Sunday 21 August 2016 at 21:22:38, Damian wrote:

> Am 21.08.2016 um 18:47 schrieb Marc Perkel:
> > Actually - you can match an infinite set. And maybe this is what it's
> > hard for some people to wrap their head around.
> > 
> > Suppose set A contains 2 items, apples and oranges.
> > So we define set B as everything in the universe that is not in set A.
> > So set B is an infinite set, everything in the universe EXCEPT apples
> > and oranges.
> 
> There is no such set B, as it would contain itself.

In that case try the definition: "B contains all possible email tokens which 
are not in set A", thus excluding sets themselves from being members of B.


Antony.

-- 
This sentence contains exacly three erors.

                                                   Please reply to the list;
                                                         please *don't* CC me.

Re: Matching infinite sets

Posted by Martin Gregorie <ma...@gregorie.org>.

On Sun, 2016-08-21 at 16:56 -0400, Dianne Skoll wrote:
> On Sun, 21 Aug 2016 21:22:38 +0200
> Damian <sp...@arcsin.de> wrote:
> 
> > 
> > > 
> > > So we define set B as everything in the universe that is not in
> > > set
> > > A. So set B is an infinite set, everything in the universe EXCEPT
> > > apples and oranges.
> > 
> > There is no such set B, as it would contain itself.
> And... why can't a set contain itself?
> 
Because recursive sets are off topic.�

At least, I assume that if Marc had meant to include recursion he would
have said so.


Martin

Re: Matching infinite sets

Posted by Joe Quinn <jq...@pccc.com>.

On 8/21/2016 5:55 PM, Sidney Markowitz wrote:
> Dianne Skoll wrote on 22/08/16 8:56 AM:
>> And... why can't a set contain itself?
>>
> It can't in standard modern set theory (ZFC), through the foundation axioms,
> also known as the axiom of regularity
>    https://en.wikipedia.org/wiki/Axiom_of_regularity
> which is a formulation that allows set theory to avoid Russell's Paradox.
> (see also https://en.wikipedia.org/wiki/ZFC)
>
> Just like Euclidean Geometry has the axiom that parallel lines never meet, and
> you get various non-euclidean geometries by changing that axiom, there are
> non-standard set theories that do not include the axiom of regularity, in
> which there can be sets that include themselves.
>
> None of that is relevant to the discussion of Marc Perkel's ideas because he
> is talking about sets of tokens from email (or sets of potential tokens?) not
> sets that contain sets. And all he needs to do with his infinite sets is be
> able to test if a token is in it, which is easy to do since the set is defined
> as the complement of a finite set. (I'm not saying this to agree with the
> method as good or to argue against it. I'm one of those people he mentions who
> understands how Bayesian spam filtering works who has yet to wrap my head
> around what he is presenting - For now I'm staying agnostic about it until I
> do understand it better).
>
>   Sidney
This is a good summary. As a fun theoretical side-note, ZFC can be 
interpreted as a type theory and then used as a way to reason about the 
behavior of programs. One of its major weaknesses is that it's possible 
to formulate exactly this sort of issue where a set can contain other 
sets of unknown depth. This corresponds to untyped programming languages 
and is almost always resolved by formalizations that correspond to 
adding a type system (as your last paragraph does).

But back to discussing Bayes... ;)

Re: Matching infinite sets

Posted by RW <rw...@googlemail.com>.

On Mon, 22 Aug 2016 09:55:10 +1200
Sidney Markowitz wrote:

>  I'm one of those people he mentions who understands
> how Bayesian spam filtering works who has yet to wrap my head around
> what he is presenting - For now I'm staying agnostic about it until I
> do understand it better).

What it amounts to is:

Training: 

- tokenize a corpus of spam and ham 
- compile a list of tokens that occur only in spam and a list of
  tokens that only occur in ham

Classification:

- Tokenize the email
- count how many of the tokens are in each of the two list
- compare the two counts

In Bayes, if you set Robinson's S parameter to 0, then tokens that only
occur in spam or ham get a token probability of exactly 1 and 0
respectively. 

Tokens that have been seen in both spam and ham get a probability
between 0 and 1. So if you then set MIN_PROB_STRENGTH to 0.5 you can
discard all of these. 

All of the remaining tokens have probabilities of 0 or 1 so running
them through the chi-squared calculation (or any sensible symmetric
combining algorithm) and then comparing the result to 0.5  gives the
same result as comparing the number of spam-only and ham-only tokens.

In short it's mathematically equivalent to Bayes with different
tokenization and different constants; and on the face of it
the values of S and MIN_PROB_STRENGTH are very sub-optimal. 

OTOH it wouldn't surprise me if the tokenization is much better.

Re: Matching infinite sets

Posted by Sidney Markowitz <si...@sidney.com>.

Dianne Skoll wrote on 22/08/16 8:56 AM:
> And... why can't a set contain itself?
> 

It can't in standard modern set theory (ZFC), through the foundation axioms,
also known as the axiom of regularity
  https://en.wikipedia.org/wiki/Axiom_of_regularity
which is a formulation that allows set theory to avoid Russell's Paradox.
(see also https://en.wikipedia.org/wiki/ZFC)

Just like Euclidean Geometry has the axiom that parallel lines never meet, and
you get various non-euclidean geometries by changing that axiom, there are
non-standard set theories that do not include the axiom of regularity, in
which there can be sets that include themselves.

None of that is relevant to the discussion of Marc Perkel's ideas because he
is talking about sets of tokens from email (or sets of potential tokens?) not
sets that contain sets. And all he needs to do with his infinite sets is be
able to test if a token is in it, which is easy to do since the set is defined
as the complement of a finite set. (I'm not saying this to agree with the
method as good or to argue against it. I'm one of those people he mentions who
understands how Bayesian spam filtering works who has yet to wrap my head
around what he is presenting - For now I'm staying agnostic about it until I
do understand it better).

 Sidney

Re: Matching infinite sets

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Sun, 21 Aug 2016 21:22:38 +0200
Damian <sp...@arcsin.de> wrote:

> > So we define set B as everything in the universe that is not in set
> > A. So set B is an infinite set, everything in the universe EXCEPT
> > apples and oranges.

> There is no such set B, as it would contain itself.

And... why can't a set contain itself?

Regards,

Dianne.

Re: Matching infinite sets

Posted by Antony Stone <An...@spamassassin.open.source.it>.

On Monday 22 August 2016 at 15:04:49, Marc Perkel wrote:

> I'm confused by the confusion here.
> 
> Set A - a  finite set - has some members,
> Set B - and infinite set - is everything that is NOT in Set A
> 
> So you match a test item to Set A and if it matches it's a member of A.
> If it doesn't match Set A it's a member of B.
> 
> How is this not really simple?

Because "everything that is NOT in Set A" means some surprisingly complicated 
things to some people, and which I believe for the purposes of your spam 
identifier are irrelevant.

It might keep the pedants happier if you were to identify the sets as:

Set A contains some email tokens.

Set B contains all possible email tokens which are not in Set A.

This then precludes the possibility that Set B might contain itself, since a 
set is not a plausible email token.

Antony.

-- 
I just got a new mobile phone, and I called it Titanic.  It's already syncing.

                                                   Please reply to the list;
                                                         please *don't* CC me.

Re: Matching infinite sets

Posted by Christian Grunfeld <ch...@gmail.com>.

What you are trying to do is to identify a source of messages by its
entropy....supposed the entropy of a ham source is distinguishable from a
spam one...

2016-08-22 13:48 GMT-03:00 Antony Stone <
Antony.Stone@spamassassin.open.source.it>:

> On Monday 22 August 2016 at 18:00:35, Marc Perkel wrote:
>
> > On 08/22/16 07:37, Antony Stone wrote:
> > >
> > > So what makes "cheapest Viagra online" a token, such that "cheapest"
> and
> > > "online" are not tokens?
> >
> > They would all be tokens. Just pointing out one that would match spam
> > and not match ham. "cheapest" and "online" would likely be in both sets
> > and would be ignored.
>
> Hm, that doesn't tie up with your earlier reply:
>
> On Monday 22 August 2016 at 16:34:00, Marc Perkel wrote:
>
> > On 08/22/16 07:28, Dianne Skoll wrote:
> > > On Mon, 22 Aug 2016 07:16:41 -0700
> > >
> > > As far as I understand your algorithm, if an email contains at least
> one
> > > token in the "ham" set and zero tokens in the "spam" set, you classify
> it
> > > as ham.  And conversely, if it contains at least one spam token but
> zero
> > > ham tokens, you classify it as spam.
> >
> > YES! YES! YES!
>
> Er, really?  See below.
>
> > Although I look at some thousand "fingerprints" to get a more
> > significant result.
> >
> > > The other two possibilities (no tokens in either or some tokens in
> both)
> > > are undecidable.
> >
> > Exactly!
>
> So, it's not that "if an email contains at least one token in the 'ham' set
> and zero tokens in the 'spam' set, you classify it as ham".
>
> You in fact ignore any tokens in the email which are in both the 'ham' and
> 'spam' sets, and then - what - work out which set contains more of the
> left-
> over tokens?
>
>
> Antony.
>
> --
> Pavlov is in the pub enjoying a pint.
> The barman rings for last orders, and Pavlov jumps up exclaiming "Damn!  I
> forgot to feed the dog!"
>
>                                                    Please reply to the
> list;
>                                                          please *don't* CC
> me.
>

Re: Matching infinite sets

Posted by Antony Stone <An...@spamassassin.open.source.it>.

On Monday 22 August 2016 at 18:00:35, Marc Perkel wrote:

> On 08/22/16 07:37, Antony Stone wrote:
> > 
> > So what makes "cheapest Viagra online" a token, such that "cheapest" and
> > "online" are not tokens?
>
> They would all be tokens. Just pointing out one that would match spam
> and not match ham. "cheapest" and "online" would likely be in both sets
> and would be ignored.

Hm, that doesn't tie up with your earlier reply:

On Monday 22 August 2016 at 16:34:00, Marc Perkel wrote:

> On 08/22/16 07:28, Dianne Skoll wrote:
> > On Mon, 22 Aug 2016 07:16:41 -0700
> > 
> > As far as I understand your algorithm, if an email contains at least one
> > token in the "ham" set and zero tokens in the "spam" set, you classify it
> > as ham.  And conversely, if it contains at least one spam token but zero
> > ham tokens, you classify it as spam.
> 
> YES! YES! YES!

Er, really?  See below.

> Although I look at some thousand "fingerprints" to get a more
> significant result.
> 
> > The other two possibilities (no tokens in either or some tokens in both)
> > are undecidable.
> 
> Exactly!

So, it's not that "if an email contains at least one token in the 'ham' set 
and zero tokens in the 'spam' set, you classify it as ham".

You in fact ignore any tokens in the email which are in both the 'ham' and 
'spam' sets, and then - what - work out which set contains more of the left-
over tokens?


Antony.

-- 
Pavlov is in the pub enjoying a pint.
The barman rings for last orders, and Pavlov jumps up exclaiming "Damn!  I 
forgot to feed the dog!"

                                                   Please reply to the list;
                                                         please *don't* CC me.

Re: Matching infinite sets

Posted by Marc Perkel <su...@junkemailfilter.com>.


On 08/22/16 07:37, Antony Stone wrote:
> On Monday 22 August 2016 at 16:34:09, Marc Perkel wrote:
>
>> OK - Trying to make the really simple. Just talking about concept now.
>>
>> Let's say I get an email where the subject is "I have aednocarsonoma of
>> the lung".
>>
>> Right off you know it's ham because spammers never use the word
>> "aednocarsonoma" and normal people do. Spammer also never use:
>>
>> "of the lung"
>> "the lung"
>> "aednocarsonoma of"
> How do you create those boundaries to define the tokens?

Here's an example:

"the quick brown fox jumps over the lazy dog"

becomes ...

"the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown fox" "quick brown fox"
"the quick brown fox" "jumps" "fox jumps" "brown fox jumps" "quick brown fox jumps" "over" "jumps over"
"fox jumps over" "brown fox jumps over" "the" "over the" "jumps over the" "fox jumps over the"
"lazy" "the lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy dog" "over the lazy dog"






>
>> ....
>>
>> So - tell me you follow this so far. Spammers don't spam about
>> aednocarsonoma.
>>
>> In this case I'm identifying ham because in some previous email people
>> were talking about lung cancer and those phrases were learned as ham.
>> But what makes it really ham is not just that it matches previous ham,
>> but it doesn't match previous spam.
>>
>> A word like Viagra for example would produce no score because it is in
>> both sets. However "cheapest viagra online" would match spam and not
>> match ham indicating it's spam.
> So what makes "cheapest Viagra online" a token, such that "cheapest" and
> "online" are not tokens?
>
>

They would all be tokens. Just pointing out one that would match spam 
and not match ham. "cheapest" and "online" would likely be in both sets 
and would be ignored.

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: Matching infinite sets

Posted by Antony Stone <An...@spamassassin.open.source.it>.

On Monday 22 August 2016 at 16:34:09, Marc Perkel wrote:

> OK - Trying to make the really simple. Just talking about concept now.
> 
> Let's say I get an email where the subject is "I have aednocarsonoma of
> the lung".
> 
> Right off you know it's ham because spammers never use the word
> "aednocarsonoma" and normal people do. Spammer also never use:
> 
> "of the lung"
> "the lung"
> "aednocarsonoma of"

How do you create those boundaries to define the tokens?

> ....
> 
> So - tell me you follow this so far. Spammers don't spam about
> aednocarsonoma.
> 
> In this case I'm identifying ham because in some previous email people
> were talking about lung cancer and those phrases were learned as ham.
> But what makes it really ham is not just that it matches previous ham,
> but it doesn't match previous spam.
> 
> A word like Viagra for example would produce no score because it is in
> both sets. However "cheapest viagra online" would match spam and not
> match ham indicating it's spam.

So what makes "cheapest Viagra online" a token, such that "cheapest" and 
"online" are not tokens?


Antony.

-- 
The words "e pluribus unum" on the Great Seal of the United States are from a 
poem by Virgil entitled "Moretum", which is about cheese and garlic salad 
dressing.

                                                   Please reply to the list;
                                                         please *don't* CC me.

Re: Matching infinite sets

Posted by Marc Perkel <su...@junkemailfilter.com>.

OK - Trying to make the really simple. Just talking about concept now.

Let's say I get an email where the subject is "I have aednocarsonoma of 
the lung".

Right off you know it's ham because spammers never use the word 
"aednocarsonoma" and normal people do. Spammer also never use:

"of the lung"
"the lung"
"aednocarsonoma of"
....

So - tell me you follow this so far. Spammers don't spam about 
aednocarsonoma.

In this case I'm identifying ham because in some previous email people 
were talking about lung cancer and those phrases were learned as ham. 
But what makes it really ham is not just that it matches previous ham, 
but it doesn't match previous spam.

A word like Viagra for example would produce no score because it is in 
both sets. However "cheapest viagra online" would match spam and not 
match ham indicating it's spam.

The magic here is that this detects both spam and ham. And it is 
especially good at detecting ham, which greatly reduces false positives.

Re: Matching infinite sets

Posted by Shawn Bakhtiar <sh...@hotmail.com>.

On Aug 22, 2016, at 10:44 AM, Marc Perkel <su...@junkemailfilter.com>> wrote:

On 08/22/16 09:06, Dianne Skoll wrote:
On Mon, 22 Aug 2016 09:03:38 -0700
Marc Perkel <su...@junkemailfilter.com>> wrote:

The ones that are the same are of no interest. Only where it matches
one side and not the other.
But... but... that's exactly like Bayes if you throw out tokens whose
observed probability is not 0 or 1.

Also, in your list of tokens, they are all phrases ranging from 1 to 4 words,
and that's why you get good results. Multiword Bayes is just as good,
and I know that from experience.

This is nothing like bayes. Bayes is creating a mental block. When I describe it to people who don't know bayes they immediately get it. If I describe it to people who know bayes - they confuse it. Bayes is a probability spectrum based on a frequency match on both sets. That's not even close to what I'm doing.

I think you've copied and pasted this same paragraph half a dozen times now, and the list has tried it's best to accommodate your statement about "Bayes is creating a mental block", asking you pertinent questions that either remained un-answered, and/or when answered provided conflicting statements, and when pressed ended up showing that what you are doing is (at best) a slightly modified version.

However, I find the statement "When I describe it to people who don't know bayes they immediately get it" the most telling of them all. Of course people who don't know the probability theory will look at what you are doing and go "Wow!!! This is great!!" BECAUSE THEY DON'T KNOW.

People who know, obviously, recognize it for what it is, and you can claim as much as you like it's NOT, but at the end of they day, if it looks like a rose, smells like a rose (no matter what you call it) tis still rose!

All you have to do is READ the Process section of the following link to see exactly how similar your explanation is (save one factor which is using phrases vs. words), which has already been explained as a feature of SA using multi-word tokens:
https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering

Also - some of what I'm doing is all combinations, not just sequential. So it's like a system that writes and scores it's own rules. I just throw data at it and it classifies it.

The real magic is the feedback learning. So as it identifies ham it learns new words and phrases that then match email from other people. So it learns how normal people speak, it learns how spammers speak, and it identifies the DIFFERENCES between the two. And it's completely automated.

--
Marc Perkel - Sales/Support
support@junkemailfilter.com<ma...@junkemailfilter.com>
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: Matching infinite sets

Posted by John Hardin <jh...@impsec.org>.

On Mon, 22 Aug 2016, Matus UHLAR - fantomas wrote:

>> > On Mon, 22 Aug 2016 09:03:38 -0700
>> > Marc Perkel <su...@junkemailfilter.com> wrote:
>
>> The real magic is the feedback learning. So as it identifies ham it learns 
>> new words and phrases that then match email from other people. So it learns 
>> how normal people speak, it learns how spammers speak, and it identifies 
>> the DIFFERENCES between the two. And it's completely automated.
>
> This it just the same as SA bayas with autolearning. However it will suffer
> the same issues and thus will require learning by other sources, either
> manual or other SA rules.

The restriction to probabilities 0 or 1 may mitigate the 
robot-off-the-rails syndrome to a degree.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Politicians never accuse you of "greed" for wanting other people's
   money, only for wanting to keep your own money.    -- Joseph Sobran
-----------------------------------------------------------------------
  2 days until the 1937th anniversary of the destruction of Pompeii

Re: Matching infinite sets

Posted by Ted Mittelstaedt <te...@ipinc.net>.

On 8/22/2016 11:40 AM, Matus UHLAR - fantomas wrote:
>>> On Mon, 22 Aug 2016 09:03:38 -0700
>>> Marc Perkel <su...@junkemailfilter.com> wrote:
>>>> The ones that are the same are of no interest. Only where it matches
>>>> one side and not the other.
>
>> On 08/22/16 09:06, Dianne Skoll wrote:
>>> But... but... that's exactly like Bayes if you throw out tokens whose
>>> observed probability is not 0 or 1.
>>>
>>> Also, in your list of tokens, they are all phrases ranging from 1 to
>>> 4 words,
>>> and that's why you get good results. Multiword Bayes is just as good,
>>> and I know that from experience.
>
> On 22.08.16 10:44, Marc Perkel wrote:
>> This is nothing like bayes. Bayes is creating a mental block.
>
> This is just like bayes.
> There are (only) a few differences between what you describe and bayes as
> implemented in SA, but it's still bayes-based.
>
>> When I describe it to people who don't know bayes they immediately get
>> it. If I describe it to people who know bayes - they confuse it. Bayes
>> is a probability spectrum based on a frequency match on both sets.
>> That's not even close to what I'm doing.
>
> Bayes uses probabilities between 0 and 1, while you only accept 0 and 1.
> You have just tweaked bayes, and I'm not even sure if towards better
> detection (i believe, towards worse)
>
>> Also - some of what I'm doing is all combinations, not just
>> sequential. So it's like a system that writes and scores it's own
>> rules. I just throw data at it and it classifies it.
>
> The main difference between bayes as implemented in SA is that you make
> multiword tokens. This is good, but you aren't even first one who proposed
> or did that. The second main difference is in the point above.
>
>> The real magic is the feedback learning. So as it identifies ham it
>> learns new words and phrases that then match email from other people.
>> So it learns how normal people speak, it learns how spammers speak,
>> and it identifies the DIFFERENCES between the two. And it's completely
>> automated.
>
> This it just the same as SA bayas with autolearning. However it will suffer
> the same issues and thus will require learning by other sources, either
> manual or other SA rules.
>

You see, Marc, this has circled around to exactly what I said last week.

The problem I have always had with SA and the Bays learner is that for 
it to work, it requires sources.   In SA it requires a source of spam to 
build tokens and (I guess) requires a source of ham to remove them.  In 
your system it requires a source of ham to build tokens and (I guess)
requires a source of spam to remove them.

But the fundamental problem with all of these is in getting the sources.

Getting spam is simple.   I merely review my email logs looking for 
spammers sending to non-existent e-mail addresses that have NEVER been 
on my server.  When I see a lot of the same attempts I then create a
honeypot email address using that.   Within a couple months I have
some of the highest quality spam available as spammers communicate the
"discovered" email address to each other.   All automatically done.

But, getting ham is HARD.   You have to convince users to give it to 
you.  And you cannot really trust users to do it without contaminating
their ham stream with spam they were too lazy to delete.   So I end up 
wasting a lot of time cleaning the ham before inputting it into SA.

This is why I have said before - and I will repeat it again - that if 
you have found a good way to convince your users to offer up cleaned
ham in an automatic fashion, that would be revolutionary.

It is NOT the back end that matters!!!!!!   That is easy.   I can hire
some programmers and math majors who have doctorates in set theory to
build that part of it, and they can probably do it in an afternoon and
then go out for pizza.

It is the front end that is hard!!!!   And its particularly hard when 
your interface is either IMAP or POP3.   Providing a webinterface that 
forces users to sort ham is somewhat easier but not not all users want a
webinterface.   I personally don't use one myself why would I expect my
users to do it?

You have repeatedly put down whatever user interface you have built by
referring to it as crude programming and you don't want to show it. 
But what you don't seem to get is that every scrap of user interface 
code out there is some of the crudest ugliest most icky and disgusting
code out there.

Users are people and people DO NOT logically interact with computers. 
They use a combination of sort-of-logic, rubbish they learned from some
other interface, and God-knows-what else to operate software interfaces.
So you can design the most elegant and cleanest interface in the world
with the most elegant code behind it and release it to the world and
God-help-you within 5 years that interface code will be so fugly that
you can only force newbie greehorn programmers who have no experience 
but are so desperate to work for you that they will do anything, to work
on it.  And eventually not even then, so you scrap it and release 
Windows 8 and start the cycle all over again.  ( If you think the 
Windows 10 user interface code is less ugly than 8 I have a bridge to 
sell you)

You should not be embarrassed about your ugly user interface code.   You
should be proud of the fact that you got it to work at all.  There's
plenty of commercial user interfaces that don't work at all (windows 8)

But, you don't want to show us your fugly user interface code that 
produces clean ham.   You just want to show us your elegant back-end 
code that digests clean ham.  Well, I already have a back end that eats
clean ham - maybe it don't work as good as yours - but if I replace my
back end with yours - I still have the same problem as before, I'm still
trying to find clean ham!!

So, congratulations Marc!   You are now no different than any other 
programmer out there!  You are an actual programmer now and have passed 
the litmus test because you just want to give us code we can't use and 
not the code we need! <eyeroll>

Ted

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Re: Matching infinite sets

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>>On Mon, 22 Aug 2016 09:03:38 -0700
>>Marc Perkel <su...@junkemailfilter.com> wrote:
>>>The ones that are the same are of no interest. Only where it matches
>>>one side and not the other.

>On 08/22/16 09:06, Dianne Skoll wrote:
>>But... but... that's exactly like Bayes if you throw out tokens whose
>>observed probability is not 0 or 1.
>>
>>Also, in your list of tokens, they are all phrases ranging from 1 to 4 words,
>>and that's why you get good results.  Multiword Bayes is just as good,
>>and I know that from experience.

On 22.08.16 10:44, Marc Perkel wrote:
>This is nothing like bayes. Bayes is creating a mental block.

This is just like bayes.
There are (only) a few differences between what you describe and bayes as
implemented in SA, but it's still bayes-based.

> When I 
>describe it to people who don't know bayes they immediately get it. 
>If I describe it to people who know bayes - they confuse it. Bayes is 
>a probability spectrum based on a frequency match on both sets. 
>That's not even close to what I'm doing.

Bayes uses probabilities between 0 and 1, while you only accept 0 and 1. 

You have just tweaked bayes, and I'm not even sure if towards better
detection (i believe, towards worse)

>Also - some of what I'm doing is all combinations, not just 
>sequential. So it's like a system that writes and scores it's own 
>rules. I just throw data at it and it classifies it.

The main difference between bayes as implemented in SA is that you make
multiword tokens.  This is good, but you aren't even first one who proposed
or did that.  The second main difference is in the point above.

>The real magic is the feedback learning. So as it identifies ham it 
>learns new words and phrases that then match email from other people. 
>So it learns how normal people speak, it learns how spammers speak, 
>and it identifies the DIFFERENCES between the two. And it's 
>completely automated.

This it just the same as SA bayas with autolearning. However it will suffer
the same issues and thus will require learning by other sources, either
manual or other SA rules.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
The 3 biggets disasters: Hiroshima 45, Tschernobyl 86, Windows 95

Re: Matching infinite sets

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Mon, 22 Aug 2016 10:44:42 -0700
Marc Perkel <su...@junkemailfilter.com> wrote:

> This is nothing like bayes.

It's exactly like Bayes.  You're stumbling across a hacked version of
Bayes.  You seem to lack the mathematical background to see what you're
doing, thinking it's somehow fundamentally different.  But it's not.

> The real magic is the feedback learning.

Which is how Bayes works.

> So as it identifies ham it learns new words and phrases that then
> match email from other people.

Which is what Bayes does.

> So it learns how normal people speak, it learns how spammers speak,
> and it identifies the DIFFERENCES between the two. And it's
> completely automated.

You've just described Bayes.  Paul Graham used almost that exact language
14 years ago in his classic paper, http://www.paulgraham.com/spam.html
Check out this paragraph:

    I'm more hopeful about Bayesian filters, because they evolve with the
    spam. So as spammers start using "c0ck" instead of "cock" to evade
    simple-minded spam filters based on individual words, Bayesian filters
    automatically notice. Indeed, "c0ck" is far more damning evidence than
    "cock", and Bayesian filters know precisely how much more.

Regards,

Dianne.

Re: Matching infinite sets

Posted by Marc Perkel <su...@junkemailfilter.com>.

On 08/22/16 09:06, Dianne Skoll wrote:
> On Mon, 22 Aug 2016 09:03:38 -0700
> Marc Perkel <su...@junkemailfilter.com> wrote:
>
>> The ones that are the same are of no interest. Only where it matches
>> one side and not the other.
> But... but... that's exactly like Bayes if you throw out tokens whose
> observed probability is not 0 or 1.
>
> Also, in your list of tokens, they are all phrases ranging from 1 to 4 words,
> and that's why you get good results.  Multiword Bayes is just as good,
> and I know that from experience.
>
>

This is nothing like bayes. Bayes is creating a mental block. When I 
describe it to people who don't know bayes they immediately get it. If I 
describe it to people who know bayes - they confuse it. Bayes is a 
probability spectrum based on a frequency match on both sets. That's not 
even close to what I'm doing.

Also - some of what I'm doing is all combinations, not just sequential. 
So it's like a system that writes and scores it's own rules. I just 
throw data at it and it classifies it.

The real magic is the feedback learning. So as it identifies ham it 
learns new words and phrases that then match email from other people. So 
it learns how normal people speak, it learns how spammers speak, and it 
identifies the DIFFERENCES between the two. And it's completely automated.

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: Matching infinite sets

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Mon, 22 Aug 2016 09:03:38 -0700
Marc Perkel <su...@junkemailfilter.com> wrote:

> The ones that are the same are of no interest. Only where it matches
> one side and not the other.

But... but... that's exactly like Bayes if you throw out tokens whose
observed probability is not 0 or 1.

Also, in your list of tokens, they are all phrases ranging from 1 to 4 words,
and that's why you get good results.  Multiword Bayes is just as good,
and I know that from experience.

Regards,

Dianne.

Re: Matching infinite sets

Posted by Marc Perkel <su...@junkemailfilter.com>.


On 08/22/16 07:40, Antony Stone wrote:
> On Monday 22 August 2016 at 16:34:00, Marc Perkel wrote:
>
>> On 08/22/16 07:28, Dianne Skoll wrote:
>>
>>> What percentage of emails using your algorithm are actually
>>> decidable?
>> Almost 100% if you look at a wide variety of tokens from multiple
>> attributes. Subject, body, content flags, header structure, combinations
>> of all domains reference, php scripts, name part of from addresses,
>> behavior flags.
> I would have said that a very large number of the words used in spam mails are
> the same as the words used in ham mails, so I suspect I'm confused about what
> constitutes a "token".

The ones that are the same are of no interest. Only where it matches one 
side and not the other.

>
> I fail to see how the "name part of from addresses" are unlikely to match ham,
> for example, since I see quite a lot of spam apparently from myself.
>
>
> Antony.
>

Some spammers have Viagra in the name part. The name part is very 
spammy. I also store to and from email addresses so that relationships 
between people corresponding create a ham result. (I filter outbound as 
well for some people)

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: Matching infinite sets

Posted by Antony Stone <An...@spamassassin.open.source.it>.

On Monday 22 August 2016 at 16:34:00, Marc Perkel wrote:

> On 08/22/16 07:28, Dianne Skoll wrote:
> 
> > What percentage of emails using your algorithm are actually
> > decidable?
> 
> Almost 100% if you look at a wide variety of tokens from multiple
> attributes. Subject, body, content flags, header structure, combinations
> of all domains reference, php scripts, name part of from addresses,
> behavior flags.

I would have said that a very large number of the words used in spam mails are 
the same as the words used in ham mails, so I suspect I'm confused about what 
constitutes a "token".

I fail to see how the "name part of from addresses" are unlikely to match ham, 
for example, since I see quite a lot of spam apparently from myself.

Antony.

-- 
Never automate fully anything that does not have a manual override capability. 
Never design anything that cannot work under degraded conditions in emergency.

                                                   Please reply to the list;
                                                         please *don't* CC me.

Re: Matching infinite sets

Posted by Shawn Bakhtiar <sh...@hotmail.com>.

> On Aug 22, 2016, at 8:09 AM, John Hardin <jh...@impsec.org> wrote:
> 
> On Mon, 22 Aug 2016, Antony Stone wrote:
> 
>> On Monday 22 August 2016 at 16:45:09, Dianne Skoll wrote:
>> 
>>> On Mon, 22 Aug 2016 07:34:00 -0700 Marc Perkel wrote:
>>>>> So.  What percentage of emails using your algorithm are actually
>>>>> decidable?
>>>> 
>>>> Almost 100% if you look at a wide variety of tokens from multiple
>>>> attributes.
>>> 
>>> I can't believe that, or I'm missing something.  Almost every spam I see
>>> contains words that also appear in ham.  Things like "this" or "invoice"
>>> or "regards" or "dear".
>>> 
>>> What am I missing?
>> 
>> I believe you're missing Marc's definition of "token".
> 
> ...and it looks like we're venturing into the "SA Bayes multiple-word token support" realm (as a surrogate).
> 

Even with the multiple tokens combined into one fingerprint, you've changed little. No matter how you bound the token, the assumption that there are not SPAM emails that contain HAM content, and vice versa is false. 

Regardless that is NOT what you claimed before, you seem to be flip-flopping between definitions to suite your argument.


> -- 
> John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
> jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
>  USMC Rules of Gunfighting #6: If you can choose what to bring to a
>  gunfight, bring a long gun and a friend with a long gun.
> -----------------------------------------------------------------------
> 2 days until the 1937th anniversary of the destruction of Pompeii

Re: Matching infinite sets

Posted by John Hardin <jh...@impsec.org>.

On Mon, 22 Aug 2016, Antony Stone wrote:

> On Monday 22 August 2016 at 16:45:09, Dianne Skoll wrote:
>
>> On Mon, 22 Aug 2016 07:34:00 -0700 Marc Perkel wrote:
>>>> So.  What percentage of emails using your algorithm are actually
>>>> decidable?
>>>
>>> Almost 100% if you look at a wide variety of tokens from multiple
>>> attributes.
>>
>> I can't believe that, or I'm missing something.  Almost every spam I see
>> contains words that also appear in ham.  Things like "this" or "invoice"
>> or "regards" or "dear".
>>
>> What am I missing?
>
> I believe you're missing Marc's definition of "token".

...and it looks like we're venturing into the "SA Bayes multiple-word 
token support" realm (as a surrogate).

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   USMC Rules of Gunfighting #6: If you can choose what to bring to a
   gunfight, bring a long gun and a friend with a long gun.
-----------------------------------------------------------------------
  2 days until the 1937th anniversary of the destruction of Pompeii

Re: Matching infinite sets

Posted by Antony Stone <An...@spamassassin.open.source.it>.

On Monday 22 August 2016 at 16:45:09, Dianne Skoll wrote:

> On Mon, 22 Aug 2016 07:34:00 -0700 Marc Perkel wrote:
> > > So.  What percentage of emails using your algorithm are actually
> > > decidable?
> > 
> > Almost 100% if you look at a wide variety of tokens from multiple
> > attributes.
> 
> I can't believe that, or I'm missing something.  Almost every spam I see
> contains words that also appear in ham.  Things like "this" or "invoice"
> or "regards" or "dear".
> 
> What am I missing?

I believe you're missing Marc's definition of "token".


Antony.

-- 
Anyone that's normal doesn't really achieve much.

 - Mark Blair, Australian rocket engineer

                                                   Please reply to the list;
                                                         please *don't* CC me.

Re: Matching infinite sets

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Mon, 22 Aug 2016 09:06:08 -0700
Marc Perkel <su...@junkemailfilter.com> wrote:

> Hi Dianne, what your missing are word combinations. Usually it's not
> a single word but a combination of words that trigger a result.

[snip]

So that's Bayes with multi-word tokens, throwing out tokens whose
probability is neither 0 nor 1.

Regards,

Dianne.

Re: Matching infinite sets

Posted by Marc Perkel <su...@junkemailfilter.com>.

On 08/22/16 07:45, Dianne Skoll wrote:
> On Mon, 22 Aug 2016 07:34:00 -0700
> Marc Perkel <su...@junkemailfilter.com> wrote:
>
>>> So.  What percentage of emails using your algorithm are actually
>>> decidable?
>> Almost 100% if you look at a wide variety of tokens from multiple
>> attributes.
> I can't believe that, or I'm missing something.  Almost every spam I see
> contains words that also appear in ham.  Things like "this" or "invoice"
> or "regards" or "dear".
>
> What am I missing?
>
>

Hi Dianne, what your missing are word combinations. Usually it's not a 
single word but a combination of words that trigger a result.

      Example of how NOT matching works

Lets take 2 subject lines and see how this works.

Meet hot Russian Brides Online!
I read an article about Russian Brides in a magazine

A traditional spam filter using Bayesian or hard coded rules about 
Russian Brides might determine that only 1 out of 500 emails 
mentioning the phrase Russian Brides is a good email. Thus the second 
line would have points assessed against it in the classification process 
using these traditional methods.

Using the Evolution Filter the phrase Russian Brides is in both sets 
and therefore has no influence on the results. But the first subject 
matches these phrases in the Spam Only set.

Meet hot
Meet hot Russian
Meet hot Russian Brides
hot Russian Brides Online!
Russian Brides Online!
Brides Online!
Online!

The second subject matches these phrases on the ham only set that are 
never used on the spam set.

I read an article
read an article
read an article about
about Russian
an article about
in a magazine
Brides in a

So even though the phrase Russian Brides has no influence each subject 
hits either ham or spam many times where the same phrase was never used 
in the subject line in the opposite set. And the number of hits is 
significant enough just from these subjects to cause the fingerprints to 
be learned, and thats just looking at the Subject attribute. When this 
is combined with testing all attributes the messages usually come out 
strongly on one side or the other.

In rule based systems one would not normally build a white list rule to 
to allocate points based on seeing the phrase read an article about. 
Thats where the Evolution Filter is different. It didnt need to have 
that rule because since it is comparing to the infinite set of what is 
not matched on the other side, it dynamically create billions of rules 
automatically.

      [edit
      <http://wiki.junkemailfilter.com/index.php?title=The_Evolution_Spam_Filter&action=edit&section=6>]

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: Matching infinite sets

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Mon, 22 Aug 2016 07:34:00 -0700
Marc Perkel <su...@junkemailfilter.com> wrote:

> > So.  What percentage of emails using your algorithm are actually
> > decidable?

> Almost 100% if you look at a wide variety of tokens from multiple 
> attributes.

I can't believe that, or I'm missing something.  Almost every spam I see
contains words that also appear in ham.  Things like "this" or "invoice"
or "regards" or "dear".

What am I missing?

Regards,

Dianne.

Re: Matching infinite sets

Posted by Marc Perkel <su...@junkemailfilter.com>.


On 08/22/16 08:58, RW wrote:
> On Mon, 22 Aug 2016 07:34:00 -0700
> Marc Perkel wrote:
>
>> On 08/22/16 07:28, Dianne Skoll wrote:
>>> The other two possibilities (no tokens in either or some tokens in
>>> both) are undecidable.
>> Exactly!
> In the past you've said that when there are token in both you compare
> the counts.

I do a very little bit of that. I make additional sets I cal nearly-ham 
and nearly-spam where the ratio is very high, and count it as a half score.

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: Matching infinite sets

Posted by RW <rw...@googlemail.com>.

On Mon, 22 Aug 2016 07:34:00 -0700
Marc Perkel wrote:

> On 08/22/16 07:28, Dianne Skoll wrote:

> > The other two possibilities (no tokens in either or some tokens in
> > both) are undecidable.  
> 
> Exactly!

In the past you've said that when there are token in both you compare
the counts.


On Wed, 17 Aug 2016 11:02:38 -0700
Marc Perkel wrote:

>  Here's the actual formula.
> 
> card(Test_message intersect Spam diff Ham) minus card(Test_message
> intersect Ham diff Spam)
> 


On Wed, 20 Jan 2016 08:52:05 -0800
Marc Perkel wrote:

> Then you do a set
> diff both ways (ham - spam) (spam - ham) and whichever side is bigger
> wins. Generally it will match on only one side or very predominately
> on one side.

Re: Matching infinite sets

Posted by Marc Perkel <su...@junkemailfilter.com>.


On 08/22/16 07:28, Dianne Skoll wrote:
> On Mon, 22 Aug 2016 07:16:41 -0700
> Marc Perkel <su...@junkemailfilter.com> wrote:
>
>> Anthony, Yes - I don't store Set B. I store Set A. B is defined by
>> what's NOT in A. So I test A and if it's not matched it's set B. Set
>> B is just a negative match on A.
> Let me ask you a question.  As far as I understand your algorithm, if
> an email contains at least one token in the "ham" set and zero tokens in
> the "spam" set, you classify it as ham.  And conversely, if it contains
> at least one spam token but zero ham tokens, you classify it as spam.

YES! YES! YES!

Although I look at some thousand "fingerprints" to get a more 
significant result.

>
> The other two possibilities (no tokens in either or some tokens in both)
> are undecidable.

Exactly!

>
> So.  What percentage of emails using your algorithm are actually decidable?

Almost 100% if you look at a wide variety of tokens from multiple 
attributes. Subject, body, content flags, header structure, combinations 
of all domains reference, php scripts, name part of from addresses, 
behavior flags.

>
> Regards,
>
> Dianne.
>
>
>

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: Matching infinite sets

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Mon, 22 Aug 2016 07:16:41 -0700
Marc Perkel <su...@junkemailfilter.com> wrote:

> Anthony, Yes - I don't store Set B. I store Set A. B is defined by 
> what's NOT in A. So I test A and if it's not matched it's set B. Set
> B is just a negative match on A.

Let me ask you a question.  As far as I understand your algorithm, if
an email contains at least one token in the "ham" set and zero tokens in
the "spam" set, you classify it as ham.  And conversely, if it contains
at least one spam token but zero ham tokens, you classify it as spam.

The other two possibilities (no tokens in either or some tokens in both)
are undecidable.

So.  What percentage of emails using your algorithm are actually decidable?

Regards,

Dianne.

Re: Matching infinite sets

Posted by Marc Perkel <su...@junkemailfilter.com>.


On 08/22/16 06:55, Antony Stone wrote:
> On Monday 22 August 2016 at 15:46:41, Dianne Skoll wrote:
>
>> On Mon, 22 Aug 2016 06:04:49 -0700
>>
>> Marc Perkel <su...@junkemailfilter.com> wrote:
>>> Set A - a  finite set - has some members,
>>> Set B - an infinite set - is everything that is NOT in Set A
>> Set B is a very special case of an infinite set.  We're talking about
>> infinite sets in general.
>>
>> Also, you have to realize that although set B is in principle infinite,
>> in practice it is not.  Computers have finite memory, and although the
>> number of email tokens representable in the memory of a computer is very,
>> very, very large, it's not infinite.
> I do not think that Marc is proposing to actually store set B in a computer
> (or anywhere else).
>
> Set B is simply a theoretical construct, defined as the inverse of Set A, and
> to discover whether something is a member of it, you do not search through the
> infinite set B for a match, you instead check all members of finite set A for a
> non-match.
>
> If nothing in Set A matches X, then X is a member of Set B.
>
>
> Antony.
>

Anthony, Yes - I don't store Set B. I store Set A. B is defined by 
what's NOT in A. So I test A and if it's not matched it's set B. Set B 
is just a negative match on A.

-- 
Marc Perkel - Sales/Support
support@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Re: Matching infinite sets

Posted by Antony Stone <An...@spamassassin.open.source.it>.

On Monday 22 August 2016 at 15:46:41, Dianne Skoll wrote:

> On Mon, 22 Aug 2016 06:04:49 -0700
> 
> Marc Perkel <su...@junkemailfilter.com> wrote:
> > Set A - a  finite set - has some members,
> > Set B - an infinite set - is everything that is NOT in Set A
> 
> Set B is a very special case of an infinite set.  We're talking about
> infinite sets in general.
> 
> Also, you have to realize that although set B is in principle infinite,
> in practice it is not.  Computers have finite memory, and although the
> number of email tokens representable in the memory of a computer is very,
> very, very large, it's not infinite.

I do not think that Marc is proposing to actually store set B in a computer 
(or anywhere else).

Set B is simply a theoretical construct, defined as the inverse of Set A, and 
to discover whether something is a member of it, you do not search through the 
infinite set B for a match, you instead check all members of finite set A for a 
non-match.

If nothing in Set A matches X, then X is a member of Set B.

Antony.

-- 
I have an excellent memory.
I can't think of a single thing I've forgotten.

                                                   Please reply to the list;
                                                         please *don't* CC me.

Re: Matching infinite sets

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Mon, 22 Aug 2016 06:04:49 -0700
Marc Perkel <su...@junkemailfilter.com> wrote:

> Set A - a  finite set - has some members,
> Set B - and infinite set - is everything that is NOT in Set A

Set B is a very special case of an infinite set.  We're talking about
infinite sets in general.

Also, you have to realize that although set B is in principle infinte,
in practice it is not.  Computers have finite memory, and although the
number of email tokens representable in the memory of a computer is very,
very, very large, it's not infinite.

Regards,

Dianne.

Re: Matching infinite sets

Posted by Marc Perkel <su...@junkemailfilter.com>.

I'm confused by the confusion here.

Set A - a  finite set - has some members,
Set B - and infinite set - is everything that is NOT in Set A

So you match a test item to Set A and if it matches it's a member of A. 
If it doesn't match Set A it's a member of B.

How is this not really simple?

Re: Matching infinite sets

Posted by Michael Orlitzky <mi...@orlitzky.com>.

On 08/22/2016 09:02 AM, Joe Quinn wrote:
> On 8/22/2016 8:54 AM, Michael Orlitzky wrote:
>> On 08/21/2016 03:22 PM, Damian wrote:
>>> There is no such set B, as it would contain itself.
>> The empty set contains itself.
> That's an easy mistake to make. The empty set is {}, the set that
> contains only the empty set is {{}}. Sets are discrete elements that
> don't get "flattened".
> 
> In perl syntactic lists do get flattened though, which leads to some fun
> times. You can do silly things like @concatenated = (@listOne, @listTwo).

"Contains" in the context of sets means "is a superset of" =)

(I'm just being pedantic, I don't actually have a point.)

Re: Matching infinite sets

Posted by Joe Quinn <jq...@pccc.com>.

On 8/22/2016 8:54 AM, Michael Orlitzky wrote:
> On 08/21/2016 03:22 PM, Damian wrote:
>> There is no such set B, as it would contain itself.
> The empty set contains itself.
That's an easy mistake to make. The empty set is {}, the set that 
contains only the empty set is {{}}. Sets are discrete elements that 
don't get "flattened".

In perl syntactic lists do get flattened though, which leads to some fun 
times. You can do silly things like @concatenated = (@listOne, @listTwo).

Re: Matching infinite sets

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Mon, 22 Aug 2016 08:54:48 -0400
Michael Orlitzky <mi...@orlitzky.com> wrote:

> The empty set contains itself.

No, it doesn't.  By definition.

Regards,

Dianne.

Re: Matching infinite sets

Posted by Michael Orlitzky <mi...@orlitzky.com>.

On 08/21/2016 03:22 PM, Damian wrote:
>>
> There is no such set B, as it would contain itself.

The empty set contains itself.

Re: Matching infinite sets

Posted by Damian <sp...@arcsin.de>.


Am 21.08.2016 um 18:47 schrieb Marc Perkel:
> Actually - you can match an infinite set. And maybe this is what it's
> hard for some people to wrap their head around.
>
> Suppose set A contains 2 items, apples and oranges.
> So we define set B as everything in the universe that is not in set A.
> So set B is an infinite set, everything in the universe EXCEPT apples
> and oranges.
>
There is no such set B, as it would contain itself.
> Our first test set contain an orange - so it matches set A and not set B.
> Our second test set contains a cherry - so it doesn't match set A but
> it does match set B.
>
> When you have a method that matches against infinite sets to
> completely changes how you think about spam and ham detection.
>
> On 08/16/16 12:57, Shawn Bakhtiar wrote:
>>
>> /
>> /
>> /By they way, you can\u2019t match an infinite set (well theoretically but
>> not actually). /
>> /https://en.wikipedia.org/wiki/Intersection_(set_theory)/
>> <https://en.wikipedia.org/wiki/Intersection_%28set_theory%29>
>> /
>> /
>>
>
> -- 
> Marc Perkel - Sales/Support
> support@junkemailfilter.com
> http://www.junkemailfilter.com
> Junk Email Filter dot com
> 415-992-3400

Re: Matching infinite sets

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Sun, 21 Aug 2016 09:47:45 -0700
Marc Perkel <su...@junkemailfilter.com> wrote:

> So we define set B as everything in the universe that is not in set A.

That's a very specific kind of infinite set.  It's the complement of a finite set.

Try this one on for size:

Consider the set A of all positive integral powers of pi (pi, pi^2, pi^3, etc.)
That's clearly infinite.

Set B is every element x of A such that the googolth digit (that is,
the 10^100th digit) after the decimal point of the decimal expansion
of x is 7.

Good luck matching B.  It's not even clear to me whether B is infinite
or finite, though I suspect it's infinite.

There are also sets with an uncountable infinity of elements, such as
the real numbers, for which "matching" has little meaning.

Regards,

Dianne.