You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2008/05/01 21:24:34 UTC

Re: [Bug 5896] New: try out enemieslist

Michael, could you paste that to the bug on the bugzilla? your comments
will not make it in there otherwise.

--j.

Michael Peddemors writes:
> On Thursday 01 May 2008 04:03, bugzilla-daemon@bugzilla.spamassassin.org 
> wrote:
> > Steven Champeon has been in touch regarding 'testing my enemieslist rDNS
> > patterns data against the SpamAssassin spam/ham corpus(es) to see if
> > there's a reason for us to collaborate.'
> 
> > I'm curious to see how incorporating EL DNSBL lookups into SpamAssassin
> > might be useful; we have a DNSBL mirror network (currently three hosts,
> > with more on the way) or I can talk about how to use it with a patched
> > rbldnsd if you wanted to do some local testing. It'd be really
> 
> Actually, this is surprising that SA hasn't looked at something like this 
> already.. We also use a similar method in our Mail Server technologies, 
> albeit we do it in the SMTP layer.. but I think this begs a few questions..
> 
> o Should it be RBL based..
> 
> In the past SA users have been stung with RBL based lookups, when RBL's get 
> blocked etc.. leading to very high system loads..
> 
> o Should SA start integrating a definition update program for something like 
> this?
> 
> Compiling even 10k regex patterns takes very little overhead, and by doing 
> daily updates of a locally cached list there is little risk of problems even 
> when the updater fails, the latest regex's will always be on hand.
> 
> o Should this use one regex supplier, or community based?
> 
> This might be more helpful, as since there are projects like Enenies List, our 
> own DynaRegex .. or other companies, projects etc.. that might evolve out of 
> this.
> 
> It also could have several different types of regex patterns, as mentioned 
> below  so that SA users could choose score settings for some patterns 
> differently than others..  Some patterns are safe enough to score very high, 
> while generic shared webhost patterns may want to be scored a little lower.
> 
> I think that the regex pattern database would be an excellent candidate for 
> building out an SA defintion updater..
> 
> > OK, sounds good. I'm really interested in seeing what the various FP
> > rates would be for both the HELO and PTR for the various return values;
> > I'm also interested in seeing what rates are for the different
> > subclasses (as formed by the combination of A response and TXT response
> > for the same lookup, so "static/cable" or "dynamic/dsl" or
> > "natproxy/vpn"). Basically, I'm using these today as very blunt hammers,
> > and I want to make sure I have a good sense of how to better tune the
> > scoring. And you guys have such great stats, so I came to you :)
> >
> > > So, these are generally run against the SMTP connecting host's
> > > rDNS, right?
> >
> > Both PTR and HELO/EHLO string, yes. We've found that PTR is a good
> > indicator, but when the HELO string is a match for some EL pattern it's
> > a very reliable indicator of bot activity with a very low FP rate, so we
> > test both when available. Of course, this differs between the various
> > types, so I wouldn't assume webhost or outmx or static PTR are
> > necessarily bad, just indicative. But we'll see what the numbers
> > look like after we run some tests, I suppose :)
> >
> > > By the way, do you mind if we conduct this conversation on a public
> > > Bugzilla entry?  that's generally how we do it.  Doing that in the
> > > open is also more likely to get useful info on how other hosts
> > > have found the increased load from SpamAssassin lookups, too.
> >
> > No, not at all, though I definitely want to know how adding this to
> > SA would affect our load; and give me time to throw a few more rbldnsd
> > mirrors into the rotation if required. (Running lookups against the
> > patterns is very fast, 75K/s here on my macbook, but once you add
> > logging and DNS overhead it slows down considerably :-/)
> >
> > So, what next? Should we look at setting up a local rbldnsd instance
> > to isolate testing from our production machines? Was the doc I sent
> > a URL for in my last email sufficient to tweak whatever SA rules
> > you need to test? I'm here to answer any questions you have :)
> >
> >
> >
> > Anyway, usage details are here:  http://enemieslist.com/how/use.html --
> > we'd need to add some rules to do this.  I've been meaning to do this for
> > several weeks(!) but things have been busy :( so here's a new ticket.
> 
> -- 
> --
> "Catch the Magic of Linux..."
> ------------------------------------------------------------------------
> Michael Peddemors - President/CEO - LinuxMagic
> Products, Services, Support and Development
> Visit us at http://www.linuxmagic.com
> ------------------------------------------------------------------------
> A Wizard IT Company - For More Info http://www.wizard.ca
> "LinuxMagic" is a Registered TradeMark of Wizard Tower TechnoServices Ltd.
> ------------------------------------------------------------------------
> 604-589-0037 Beautiful British Columbia, Canada
> 
> This email and any electronic data contained are confidential and intended 
> solely for the use of the individual or entity to which they are addressed. 
> Please note that any views or opinions presented in this email are solely 
> those of the author and are not intended to  represent those of the company.

RE: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Harald Binkle <bi...@jam-software.com>.
> -----Original Message-----
> From: Loren Wilton [mailto:lwilton@earthlink.net]
>
> Is there a way to clear the noautolearn for the whitelist rules?
> Normal
> rules could probably do it with tflags.  Except I'm not sure that you
> can
> necessarily negate a previously set tflags value with a later tflags
> value.
> (If not, maybe it would be worth an enhancement request.)

I tried that already. No changes. It seems I can't override the tflags of the USER_IN_WHITELIST rule/shortcirciut.
(I tried to override in local.cf)


> Another solution in this case would be to not use the whitelist.  Just
> make
> a rule, or several rules and meta them together, and give the overall
> rule a
> score of -100 and set the shortcircuit and autolearn flags on the rule.
> As
> everyone has mentioned, this can still end up poisioning your database
> if
> any of those senders get joe-jobbed.  But then again, you might be
> lucky and
> it would work.

Thanks but I think they convinced me.

Harry




----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Bruchhausenstr. 1 * 54290 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

Re: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Loren Wilton <lw...@earthlink.net>.
Is there a way to clear the noautolearn for the whitelist rules?  Normal 
rules could probably do it with tflags.  Except I'm not sure that you can 
necessarily negate a previously set tflags value with a later tflags value. 
(If not, maybe it would be worth an enhancement request.)

Another solution in this case would be to not use the whitelist.  Just make 
a rule, or several rules and meta them together, and give the overall rule a 
score of -100 and set the shortcircuit and autolearn flags on the rule.  As 
everyone has mentioned, this can still end up poisioning your database if 
any of those senders get joe-jobbed.  But then again, you might be lucky and 
it would work.

        Loren


new eval functions comparing the matches of two regular expression?

Posted by Harald Binkle <bi...@jam-software.com>.
What about a new eval functions comparing the matches of two regular expression?
If there would be functions like

 eval:Equals(/regex1/,/regex2/)
and
 eval:NOTEquals(/regex1/,/regex2/)

it would be easy to define rules like:

a rule scoring, say with 0.8 points, if there is only one recipients address and that one equals the senders address but they have different 'name parts'?
Like:
TO: "User Name" <us...@domain.com>
FROM: "viagra offer" <us...@domain.com>

There are a lot of spam mails with that structure trying to get through because many people have their own domain on the whitelist.
I tried to set this up as rule but with no luck. I fear it is not possible to do this with a regular expression
as it is not possible to compare results of a regular expression in a regular expression.

Could someone implement this?

Greetings

Harry



----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Max-Planck-Str. 22 * 54296 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

RE: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Harald Binkle <bi...@jam-software.com>.
Sidney,
thank you very much for your answers and explanations.
I just looked over the code of check_forged_in_whitelist and think it's hard to use for my intention.
I will wait some days if someone else  will replay to the request of implementing eval:Equals(/regex1/,/regex2/) and eval:NOTEquals(/regex1/,/regex2/).
If no one will answer I'll post that request with a correct (more appropriate) subject in one or two weeks to the dev list again and see what others say.
The problem I have is, that we use the windows version of SpamAssassin (http://sourceforge.net/projects/sawin32/) so just implementing a plugin providing those two functions is not easy (much work).

I think those evals would give the option to write more powerful rules without the need to implement little things in plugins as it is not possible to compare matches of regular expression within the same regular expression.

Harry

> -----Original Message-----
> From: Sidney Markowitz [mailto:sidney@sidney.com]
> Sent: Wednesday, May 07, 2008 10:19 AM
> To: Harald Binkle
> Cc: 'dev@spamassassin.apache.org'
> Subject: Re: shortcircuit for USER_IN_WHITELIST --> noautolearn??
> ==>learn!
>
> Harald Binkle wrote, On 7/5/08 7:46 PM:
> > Sorry, I thought a discussion for switching the default behavior
> would be right to be
> > in dev list.
>
> Yes, I'm the one who brought up the related issues of how to handle
> learning and
> whitelisting, and I said what I did to make sure that any further
> digression to those
> topics should go to the users list. Your questions about changing the
> default behavior and
> about new eval rules would go in this list.
>
> > And what about a discussion about a new eval function comparing the
> matches of two
> > regular expression. If there would be functions
> eval:Equals(/regex1/,/regex2/) and
> > eval:NOTEquals(/regex1/,/regex2/)  it would be easy to define rules
> like the one I
> > mentioned in my last mail.
>
> I don't have an immediate opinion about this. Perhaps you could try it
> out in a plugin and
> see how it works out compared to simply using whitelist_from_rcvd to
> make the whitelisting
> work.
>
> I did once try to catch that kind of spam with an eval rule that calls
> check_forged_in_whitelist which is supposed to catch anything that
> matched the address
> portion of a whitelist_in_rcvd but doesn't match the received part of
> the test. I don't
> remember now why we don't have any rules that use that eval, it may be
> that it doesn't
> really work. You might try defining a rule
>
>    header FORGED_USER_IN_WHITELIST  eval:check_forged_in_whitelist()
>
> and also define some whitelist_from_rcvd entries and see if that rule
> has any success at
> catching those.
>
>   -- sidney




----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Bruchhausenstr. 1 * 54290 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

Re: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Sidney Markowitz <si...@sidney.com>.
Harald Binkle wrote, On 7/5/08 7:46 PM:
> Sorry, I thought a discussion for switching the default behavior would be right to be
> in dev list.

Yes, I'm the one who brought up the related issues of how to handle learning and
whitelisting, and I said what I did to make sure that any further digression to those
topics should go to the users list. Your questions about changing the default behavior and
about new eval rules would go in this list.

> And what about a discussion about a new eval function comparing the matches of two
> regular expression. If there would be functions eval:Equals(/regex1/,/regex2/) and
> eval:NOTEquals(/regex1/,/regex2/)  it would be easy to define rules like the one I
> mentioned in my last mail.

I don't have an immediate opinion about this. Perhaps you could try it out in a plugin and
see how it works out compared to simply using whitelist_from_rcvd to make the whitelisting
work.

I did once try to catch that kind of spam with an eval rule that calls
check_forged_in_whitelist which is supposed to catch anything that matched the address
portion of a whitelist_in_rcvd but doesn't match the received part of the test. I don't
remember now why we don't have any rules that use that eval, it may be that it doesn't 
really work. You might try defining a rule

   header FORGED_USER_IN_WHITELIST  eval:check_forged_in_whitelist()

and also define some whitelist_from_rcvd entries and see if that rule has any success at
catching those.

  -- sidney


RE: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Harald Binkle <bi...@jam-software.com>.
I see.
Sorry, I thought a discussion for switching the default behavior would be right to be in dev list.
And what about a discussion about a new eval function comparing the matches of two regular expression.
If there would be functions eval:Equals(/regex1/,/regex2/) and eval:NOTEquals(/regex1/,/regex2/)  it would be easy to define rules like the one I mentioned in my last mail.

(create a rule scoring say with 0.8 points if there is only one recipients address and that one equals the senders address but they have different 'name parts'?
Like:
TO: "User Name" <us...@domain.com>
FROM: "viargre offer" <us...@domain.com>

There are a lot of spam mails with that structure trying to get through because many people have their own domain on the whitelist.
I tried to set this up as rule but with no luck. I fear it is not possible to do with an regular expression.)


Harry

> -----Original Message-----
> From: Sidney Markowitz [mailto:sidney@sidney.com]
> Sent: Wednesday, May 07, 2008 9:07 AM
> To: Harald Binkle
> Cc: 'dev@spamassassin.apache.org'
> Subject: Re: shortcircuit for USER_IN_WHITELIST --> noautolearn??
> ==>learn!
>
> Harald Binkle wrote, On 7/5/08 6:30 PM:
> > Hi, ok, these are good reasons, I see. But I wrote a script setting
> all recipients of
> > outgoing mails on the whitelist. So everyone I send a message to will
> be on the
> > whitelist. Meanwhile nearly all people I have contact to are on my
> whitelist so there
> > are almost no mails I receive which will be automatically learned as
> ham.
>
> Autolearn is a way of doing the best that you can with no work, but you
> are seeing some of
> its failings. There is really no substitute for a manual learning
> procedure where you find
> a way to make it easy to specify whether email is really typical ham or
> spam and send it
> to the learner, avoiding sending atypical ham that contains words that
> you would not want
> to learn as ham. I could get into a discussion about ideas on how to do
> that without
> having to classify all your mail by hand, which of course is what you
> use SpamAssassin to
> avoid in the first place, but that's the kind of discussion that the
> SpamAssassin users
> mailing list is for.
>
> > There are a lot of spam mails with that structure trying to get
> through because many
> > people have their own domain on the whitelist. I tried to set this up
> as rule but with
> > no luck. I fear it is not possible to do with an regular expression.
>
> The proper way to do it is to use whitelist_from_rcvd instead of
> whitelist_from and put in
> a rule for each sending mail server that the person uses. Again, this
> is a topic for the
> sa-users mailing list rather than the dev list.
>
>   -- sidney




----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Bruchhausenstr. 1 * 54290 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

Re: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Sidney Markowitz <si...@sidney.com>.
Harald Binkle wrote, On 7/5/08 6:30 PM:
> Hi, ok, these are good reasons, I see. But I wrote a script setting all recipients of
> outgoing mails on the whitelist. So everyone I send a message to will be on the
> whitelist. Meanwhile nearly all people I have contact to are on my whitelist so there
> are almost no mails I receive which will be automatically learned as ham.

Autolearn is a way of doing the best that you can with no work, but you are seeing some of 
its failings. There is really no substitute for a manual learning procedure where you find 
a way to make it easy to specify whether email is really typical ham or spam and send it 
to the learner, avoiding sending atypical ham that contains words that you would not want 
to learn as ham. I could get into a discussion about ideas on how to do that without 
having to classify all your mail by hand, which of course is what you use SpamAssassin to 
avoid in the first place, but that's the kind of discussion that the SpamAssassin users 
mailing list is for.

> There are a lot of spam mails with that structure trying to get through because many
> people have their own domain on the whitelist. I tried to set this up as rule but with
> no luck. I fear it is not possible to do with an regular expression.

The proper way to do it is to use whitelist_from_rcvd instead of whitelist_from and put in 
a rule for each sending mail server that the person uses. Again, this is a topic for the 
sa-users mailing list rather than the dev list.

  -- sidney


RE: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Harald Binkle <bi...@jam-software.com>.
Hi,
ok, these are good reasons, I see.
But I wrote a script setting all recipients of outgoing mails on the whitelist.
So everyone I send a message to will be on the whitelist.
Meanwhile nearly all people I have contact to are on my whitelist so there are almost no mails I receive which will be automatically learned as ham.

Another thing regarding to your answer Matt:
Why don't create a rule scoring say with 0.8 points if there is only one recipients address and that one equals the senders address but they have different 'name parts'?
Like:
TO: "User Name" <us...@domain.com>
FROM: "viargre offer" <us...@domain.com>

There are a lot of spam mails with that structure trying to get through because many people have their own domain on the whitelist.
I tried to set this up as rule but with no luck. I fear it is not possible to do with an regular expression.


Harry


> -----Original Message-----
> From: Matt Kettler [mailto:mkettler_sa@verizon.net]
> Sent: Wednesday, May 07, 2008 7:19 AM
> To: Sidney Markowitz
> Cc: Harald Binkle; 'dev@spamassassin.apache.org'
> Subject: Re: shortcircuit for USER_IN_WHITELIST --> noautolearn??
> ==>learn!
>
> Sidney Markowitz wrote:
> > Harald Binkle wrote, On 7/5/08 1:33 AM:
> >> Hi, I just wondered why my bayes filter does not learn as much ham
> >> mails as before. Then I realized that the USER_IN_WHITELIST
> >> shortcirciut is set to spam which has tflags
> >> noautoloearn. Does this really make sense?
> >
> > The rationale is that you put an address on the whitelist when they
> > might send mail that looks like spam but you know it is really ham.
> If
> > it looks like spam, you don't want the Bayes filter to learn that it
> > is ham, because from anyone else it would be spam.
>
> Another reason not to do so is the frequency with which people
> mis-configure their whitelists.
>
> If you mistakenly whitelist_from *@mydomain.com, as many people have
> done when first setting up SA, your DNS database will be poisoned
> rather
> quickly if it allows such messages to autolearn.
>

&&&&&&&&&&&&&&&&&&&&

> -----Original Message-----
> From: Sidney Markowitz [mailto:sidney@sidney.com]
> Sent: Tuesday, May 06, 2008 10:41 PM
> To: Harald Binkle
> Cc: 'dev@spamassassin.apache.org'
> Subject: Re: shortcircuit for USER_IN_WHITELIST --> noautolearn??
> ==>learn!
>
> Harald Binkle wrote, On 7/5/08 1:33 AM:
> > Hi, I just wondered why my bayes filter does not learn as much ham
> mails as before.
> > Then I realized that the USER_IN_WHITELIST shortcirciut is set to
> spam which has tflags
> > noautoloearn. Does this really make sense?
>
> The rationale is that you put an address on the whitelist when they
> might send mail that
> looks like spam but you know it is really ham. If it looks like spam,
> you don't want the
> Bayes filter to learn that it is ham, because from anyone else it would
> be spam.
>
> Of course, someone on your whitelist can also send mail that looks like
> ham. The Bayes
> filter can't learn anything one way or the other from that mail, so it
> is sent to noautolearn.
>
>   -- sidney




----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Bruchhausenstr. 1 * 54290 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

Re: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Matt Kettler <mk...@verizon.net>.
Sidney Markowitz wrote:
> Harald Binkle wrote, On 7/5/08 1:33 AM:
>> Hi, I just wondered why my bayes filter does not learn as much ham 
>> mails as before. Then I realized that the USER_IN_WHITELIST 
>> shortcirciut is set to spam which has tflags
>> noautoloearn. Does this really make sense?
>
> The rationale is that you put an address on the whitelist when they 
> might send mail that looks like spam but you know it is really ham. If 
> it looks like spam, you don't want the Bayes filter to learn that it 
> is ham, because from anyone else it would be spam.

Another reason not to do so is the frequency with which people 
mis-configure their whitelists.

If you mistakenly whitelist_from *@mydomain.com, as many people have 
done when first setting up SA, your DNS database will be poisoned rather 
quickly if it allows such messages to autolearn.



Re: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Sidney Markowitz <si...@sidney.com>.
Harald Binkle wrote, On 7/5/08 1:33 AM:
> Hi, I just wondered why my bayes filter does not learn as much ham mails as before. 
> Then I realized that the USER_IN_WHITELIST shortcirciut is set to spam which has tflags
> noautoloearn. Does this really make sense?

The rationale is that you put an address on the whitelist when they might send mail that 
looks like spam but you know it is really ham. If it looks like spam, you don't want the 
Bayes filter to learn that it is ham, because from anyone else it would be spam.

Of course, someone on your whitelist can also send mail that looks like ham. The Bayes 
filter can't learn anything one way or the other from that mail, so it is sent to noautolearn.

  -- sidney

shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Harald Binkle <bi...@jam-software.com>.
Hi,
I just wondered why my bayes filter does not learn as much ham mails as before.
Then I realized that the USER_IN_WHITELIST shortcirciut is set to spam which has tflags noautoloearn.
Does this really make sense?
The only case a mail from a user of the whitelist is no ham could if the senders machine is infected by a virus or an Trojan.
So why don't set it back that mails from users in the withlist are learned by the bayes?

How can I set it back for me that mails from users in the withlist are learned by the bayes?

Greetings

Harry



----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Bruchhausenstr. 1 * 54290 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de